Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics
Sara Ghazanfari, Siddharth Garg, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, Francesco Croce
TL;DR
This work addresses the challenge of aligning automated perceptual similarity metrics with human judgments across uni- and multi-modal inputs. It introduces UniSim-Bench, a benchmark integrating 7 perceptual tasks over 25 datasets, and demonstrates that specialized metrics often fail to generalize to unseen tasks while general-purpose models offer broader robustness. To move toward a unified solution, the authors present UniSim, a family of multi-task perceptual metrics trained in a unified fashion: CLIP-based UniSim and UniSim-LL-N (LMM-based), trained with a hinge loss on balanced 2AFC data and leveraging LoRA for efficient fine-tuning. Results show UniSim achieves strong average performance and task transfer, though true generalization to diverse unseen tasks remains challenging, underscoring the need for further research into robust, human-aligned multi-modal similarity metrics. The work provides public code and models, laying groundwork for broader evaluation and development of unified perceptual metrics.
Abstract
Human perception of similarity across uni- and multimodal inputs is highly complex, making it challenging to develop automated metrics that accurately mimic it. General purpose vision-language models, such as CLIP and large multi-modal models (LMMs), can be applied as zero-shot perceptual metrics, and several recent works have developed models specialized in narrow perceptual tasks. However, the extent to which existing perceptual metrics align with human perception remains unclear. To investigate this question, we introduce UniSim-Bench, a benchmark encompassing 7 multi-modal perceptual similarity tasks, with a total of 25 datasets. Our evaluation reveals that while general-purpose models perform reasonably well on average, they often lag behind specialized models on individual tasks. Conversely, metrics fine-tuned for specific tasks fail to generalize well to unseen, though related, tasks. As a first step towards a unified multi-task perceptual similarity metric, we fine-tune both encoder-based and generative vision-language models on a subset of the UniSim-Bench tasks. This approach yields the highest average performance, and in some cases, even surpasses taskspecific models. Nevertheless, these models still struggle with generalization to unseen tasks, highlighting the ongoing challenge of learning a robust, unified perceptual similarity metric capable of capturing the human notion of similarity. The code and models are available at https://github.com/SaraGhazanfari/UniSim.
