Table of Contents
Fetching ...

Ranked from Within: Ranking Large Multimodal Models Without Labels

Weijie Tu, Weijian Deng, Dylan Campbell, Yu Yao, Jiyang Zheng, Tom Gedeon, Tongliang Liu

TL;DR

The paper tackles the challenge of ranking large multimodal models (LMMs) without target-domain labels. It proposes unsupervised ranking signals derived from LMM outputs—primarily token-level softmax uncertainty, self-consistent generation, and labeled proxy data—and evaluates them across nine multimodal benchmarks. Negative log-likelihood (NLL) based uncertainty proves to be the most stable and predictive regardless of task (MCVQ or VQA), while token-position choice affects ranking depending on task type; sampling-based approaches offer a label-free alternative, especially for API-based models. The findings enable practical model selection in label-scarce settings and across diverse target domains, with cross-domain correlations generally weak, emphasizing the need for uncertainty-driven ranking and calibration. The work lays a foundation for robust, label-free evaluation of LMMs and suggests several promising directions for improving ranking through test-time augmentation, semantic entropy, and exploration of internal model states.

Abstract

Can the relative performance of a pre-trained large multimodal model (LMM) be predicted without access to labels? As LMMs proliferate, it becomes increasingly important to develop efficient ways to choose between them when faced with new data or tasks. The usual approach does the equivalent of giving the models an exam and marking them. We opt to avoid marking and the associated labor of determining the ground-truth answers. Instead, we explore other signals elicited and ascertain how well the models know their own limits, evaluating the effectiveness of these signals at unsupervised model ranking. We evaluate $47$ state-of-the-art LMMs (\eg, LLaVA) across $9$ visual question answering benchmarks, analyzing how well uncertainty-based metrics can predict relative model performance. Our findings show that uncertainty scores derived from softmax distributions provide a robust and consistent basis for ranking models across various tasks. This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.

Ranked from Within: Ranking Large Multimodal Models Without Labels

TL;DR

The paper tackles the challenge of ranking large multimodal models (LMMs) without target-domain labels. It proposes unsupervised ranking signals derived from LMM outputs—primarily token-level softmax uncertainty, self-consistent generation, and labeled proxy data—and evaluates them across nine multimodal benchmarks. Negative log-likelihood (NLL) based uncertainty proves to be the most stable and predictive regardless of task (MCVQ or VQA), while token-position choice affects ranking depending on task type; sampling-based approaches offer a label-free alternative, especially for API-based models. The findings enable practical model selection in label-scarce settings and across diverse target domains, with cross-domain correlations generally weak, emphasizing the need for uncertainty-driven ranking and calibration. The work lays a foundation for robust, label-free evaluation of LMMs and suggests several promising directions for improving ranking through test-time augmentation, semantic entropy, and exploration of internal model states.

Abstract

Can the relative performance of a pre-trained large multimodal model (LMM) be predicted without access to labels? As LMMs proliferate, it becomes increasingly important to develop efficient ways to choose between them when faced with new data or tasks. The usual approach does the equivalent of giving the models an exam and marking them. We opt to avoid marking and the associated labor of determining the ground-truth answers. Instead, we explore other signals elicited and ascertain how well the models know their own limits, evaluating the effectiveness of these signals at unsupervised model ranking. We evaluate state-of-the-art LMMs (\eg, LLaVA) across visual question answering benchmarks, analyzing how well uncertainty-based metrics can predict relative model performance. Our findings show that uncertainty scores derived from softmax distributions provide a robust and consistent basis for ranking models across various tasks. This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.

Paper Structure

This paper contains 39 sections, 3 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Overview of unsupervised ranking for LMMs. In scenarios where labeled data is scarce, selecting the best-performing model can be challenging. Our approach introduces a label-free proxy ranking score designed to reflect true performance, achieving a high correlation ($\rho = 0.92$) with actual metrics. This enables unsupervised comparison of LMMs, allowing users to identify the most suitable model without needing labeled data.
  • Figure 2: Correlation analysis of model performance across benchmarks. (a) Scatter plots illustrating the Spearman's rank correlation coefficients ($\rho$) between performance on selected benchmarks, indicating how well performance on one benchmark predicts performance on another. Each point represents a model. The straight lines are fit with robust linear regression huber2011robust. (b) Heatmap of the correlation matrix for performance across eight benchmarks, with color intensity representing the strength of correlation. Higher correlations (closer to $1$) appear in red, while weaker correlations approach blue. The varying correlation strength indicates that using performance on one benchmark to rank LMMs in a target deployment environment may be inconsistent or unreliable.
  • Figure 3: An example of running one LMM for a VQA task. We also present different token positions, methods to compute token-level uncertainty and the generation of stochastic predictions.
  • Figure 4: Correlation Analysis of Fréchet Distances and Model Performance Correlation Across Datasets. Orange stars indicate the dataset pairs with the highest similarity for each dataset. Observations reveal that variations in text prompt similarity are more closely aligned with changes in performance correlation than variations in image feature similarity.
  • Figure 5: Two t-SNE plots are presented: one using image features (Top) and the other using text features (Bottom) of the datasets. We observe that the text features of OCRVQA are more scattered and significantly distant from those of other datasets.
  • ...and 11 more figures