Ranked from Within: Ranking Large Multimodal Models Without Labels
Weijie Tu, Weijian Deng, Dylan Campbell, Yu Yao, Jiyang Zheng, Tom Gedeon, Tongliang Liu
TL;DR
The paper tackles the challenge of ranking large multimodal models (LMMs) without target-domain labels. It proposes unsupervised ranking signals derived from LMM outputs—primarily token-level softmax uncertainty, self-consistent generation, and labeled proxy data—and evaluates them across nine multimodal benchmarks. Negative log-likelihood (NLL) based uncertainty proves to be the most stable and predictive regardless of task (MCVQ or VQA), while token-position choice affects ranking depending on task type; sampling-based approaches offer a label-free alternative, especially for API-based models. The findings enable practical model selection in label-scarce settings and across diverse target domains, with cross-domain correlations generally weak, emphasizing the need for uncertainty-driven ranking and calibration. The work lays a foundation for robust, label-free evaluation of LMMs and suggests several promising directions for improving ranking through test-time augmentation, semantic entropy, and exploration of internal model states.
Abstract
Can the relative performance of a pre-trained large multimodal model (LMM) be predicted without access to labels? As LMMs proliferate, it becomes increasingly important to develop efficient ways to choose between them when faced with new data or tasks. The usual approach does the equivalent of giving the models an exam and marking them. We opt to avoid marking and the associated labor of determining the ground-truth answers. Instead, we explore other signals elicited and ascertain how well the models know their own limits, evaluating the effectiveness of these signals at unsupervised model ranking. We evaluate $47$ state-of-the-art LMMs (\eg, LLaVA) across $9$ visual question answering benchmarks, analyzing how well uncertainty-based metrics can predict relative model performance. Our findings show that uncertainty scores derived from softmax distributions provide a robust and consistent basis for ranking models across various tasks. This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.
