Table of Contents
Fetching ...

Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics

Sara Ghazanfari, Siddharth Garg, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, Francesco Croce

TL;DR

This work addresses the challenge of aligning automated perceptual similarity metrics with human judgments across uni- and multi-modal inputs. It introduces UniSim-Bench, a benchmark integrating 7 perceptual tasks over 25 datasets, and demonstrates that specialized metrics often fail to generalize to unseen tasks while general-purpose models offer broader robustness. To move toward a unified solution, the authors present UniSim, a family of multi-task perceptual metrics trained in a unified fashion: CLIP-based UniSim and UniSim-LL-N (LMM-based), trained with a hinge loss on balanced 2AFC data and leveraging LoRA for efficient fine-tuning. Results show UniSim achieves strong average performance and task transfer, though true generalization to diverse unseen tasks remains challenging, underscoring the need for further research into robust, human-aligned multi-modal similarity metrics. The work provides public code and models, laying groundwork for broader evaluation and development of unified perceptual metrics.

Abstract

Human perception of similarity across uni- and multimodal inputs is highly complex, making it challenging to develop automated metrics that accurately mimic it. General purpose vision-language models, such as CLIP and large multi-modal models (LMMs), can be applied as zero-shot perceptual metrics, and several recent works have developed models specialized in narrow perceptual tasks. However, the extent to which existing perceptual metrics align with human perception remains unclear. To investigate this question, we introduce UniSim-Bench, a benchmark encompassing 7 multi-modal perceptual similarity tasks, with a total of 25 datasets. Our evaluation reveals that while general-purpose models perform reasonably well on average, they often lag behind specialized models on individual tasks. Conversely, metrics fine-tuned for specific tasks fail to generalize well to unseen, though related, tasks. As a first step towards a unified multi-task perceptual similarity metric, we fine-tune both encoder-based and generative vision-language models on a subset of the UniSim-Bench tasks. This approach yields the highest average performance, and in some cases, even surpasses taskspecific models. Nevertheless, these models still struggle with generalization to unseen tasks, highlighting the ongoing challenge of learning a robust, unified perceptual similarity metric capable of capturing the human notion of similarity. The code and models are available at https://github.com/SaraGhazanfari/UniSim.

Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics

TL;DR

This work addresses the challenge of aligning automated perceptual similarity metrics with human judgments across uni- and multi-modal inputs. It introduces UniSim-Bench, a benchmark integrating 7 perceptual tasks over 25 datasets, and demonstrates that specialized metrics often fail to generalize to unseen tasks while general-purpose models offer broader robustness. To move toward a unified solution, the authors present UniSim, a family of multi-task perceptual metrics trained in a unified fashion: CLIP-based UniSim and UniSim-LL-N (LMM-based), trained with a hinge loss on balanced 2AFC data and leveraging LoRA for efficient fine-tuning. Results show UniSim achieves strong average performance and task transfer, though true generalization to diverse unseen tasks remains challenging, underscoring the need for further research into robust, human-aligned multi-modal similarity metrics. The work provides public code and models, laying groundwork for broader evaluation and development of unified perceptual metrics.

Abstract

Human perception of similarity across uni- and multimodal inputs is highly complex, making it challenging to develop automated metrics that accurately mimic it. General purpose vision-language models, such as CLIP and large multi-modal models (LMMs), can be applied as zero-shot perceptual metrics, and several recent works have developed models specialized in narrow perceptual tasks. However, the extent to which existing perceptual metrics align with human perception remains unclear. To investigate this question, we introduce UniSim-Bench, a benchmark encompassing 7 multi-modal perceptual similarity tasks, with a total of 25 datasets. Our evaluation reveals that while general-purpose models perform reasonably well on average, they often lag behind specialized models on individual tasks. Conversely, metrics fine-tuned for specific tasks fail to generalize well to unseen, though related, tasks. As a first step towards a unified multi-task perceptual similarity metric, we fine-tune both encoder-based and generative vision-language models on a subset of the UniSim-Bench tasks. This approach yields the highest average performance, and in some cases, even surpasses taskspecific models. Nevertheless, these models still struggle with generalization to unseen tasks, highlighting the ongoing challenge of learning a robust, unified perceptual similarity metric capable of capturing the human notion of similarity. The code and models are available at https://github.com/SaraGhazanfari/UniSim.

Paper Structure

This paper contains 36 sections, 2 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Summary of our UniSim framework. First, we frame the existing multi-modal perceptual similarity tasks into our unified benchmark UniSim-Bench (from which the Core 2AFC Tasks are illustrated in the top row). Second, we show that models specialized in individual tasks (e.g. DreamSim fu2023learning, HPSv2 wu2023human, PAC-S sarto2023positive, LIQE zhang2023liqe) do not generalize well to unseen perceptual tasks, with even worse accuracy than CLIP radford2021clip. Finally, we introduce our multi-task perceptual metric, UniSim, which surpasses the baseline CLIP model and demonstrates superior or competitive performance compared with the specialized models, as depicted in the radar plots.
  • Figure 2: OOD Generalization Tasks in UniSim-Bench. We illustrate samples from the three tasks not used for training, but to evaluate the model's generalization capabilities.
  • Figure 3: Increasing the alternatives in Image-to-Text Alignment task. We report accuracy as the number of alternative images increases in the IT-2AFC (HPDv2 dataset). Both UniSim models preserve higher accuracy than the respective baselines (the gap is highlighted in the plot) as the number of alternatives grows. Notably, our encoder-based UniSim ViT-L/14 significantly outperforms the other metrics, including LMM-based UniSim.