Table of Contents
Fetching ...

What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases

Anthony Meng Huat Tiong, Junqi Zhao, Boyang Li, Junnan Li, Steven C. H. Hoi, Caiming Xiong

TL;DR

The paper addresses the challenge of evaluating large vision-language models by uncovering latent capabilities and biases in VL benchmarks. It proposes a data-driven transfer-analysis pipeline that normalizes cross-task performance, uses SVD to embed target tasks, and applies Exploratory Factor Analysis to reveal six latent VL skill factors, including a surprising length effect. It introduces OLIVE, a diverse open-world instruction dataset, which exhibits a transfer profile distinct from existing datasets and highlights areas not captured by the discovered factors. The findings inform the design of balanced, broad-coverage VL test suites and encourage data-driven task grouping rather than intuition-based categorization.

Abstract

Vision-language (VL) models, pretrained on colossal image-text datasets, have attained broad VL competence that is difficult to evaluate. A common belief is that a small number of VL skills underlie the variety of VL tests. In this paper, we perform a large-scale transfer learning experiment aimed at discovering latent VL skills from data. We reveal interesting characteristics that have important implications for test suite design. First, generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths. Second, we demonstrate that factor analysis successfully identifies reasonable yet surprising VL skill factors, suggesting benchmarks could leverage similar analyses for task selection. Finally, we present a new dataset, OLIVE (https://github.com/jq-zh/olive-dataset), which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested. Our findings contribute to the design of balanced and broad-coverage vision-language evaluation methods.

What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases

TL;DR

The paper addresses the challenge of evaluating large vision-language models by uncovering latent capabilities and biases in VL benchmarks. It proposes a data-driven transfer-analysis pipeline that normalizes cross-task performance, uses SVD to embed target tasks, and applies Exploratory Factor Analysis to reveal six latent VL skill factors, including a surprising length effect. It introduces OLIVE, a diverse open-world instruction dataset, which exhibits a transfer profile distinct from existing datasets and highlights areas not captured by the discovered factors. The findings inform the design of balanced, broad-coverage VL test suites and encourage data-driven task grouping rather than intuition-based categorization.

Abstract

Vision-language (VL) models, pretrained on colossal image-text datasets, have attained broad VL competence that is difficult to evaluate. A common belief is that a small number of VL skills underlie the variety of VL tests. In this paper, we perform a large-scale transfer learning experiment aimed at discovering latent VL skills from data. We reveal interesting characteristics that have important implications for test suite design. First, generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths. Second, we demonstrate that factor analysis successfully identifies reasonable yet surprising VL skill factors, suggesting benchmarks could leverage similar analyses for task selection. Finally, we present a new dataset, OLIVE (https://github.com/jq-zh/olive-dataset), which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested. Our findings contribute to the design of balanced and broad-coverage vision-language evaluation methods.
Paper Structure (24 sections, 6 equations, 5 figures, 18 tables)

This paper contains 24 sections, 6 equations, 5 figures, 18 tables.

Figures (5)

  • Figure 1: Examples of the OLIVE benchmark for different categories. From left to right: visual recognition, knowledge-based QA, and creative writing.
  • Figure 2: Cosine similarity between target tasks computed using SVD features.
  • Figure 3: Results of EFA on the residuals $\bar{A}$. Black arrows indicate positive loadings; red arrows indicate negative loadings. Cut-off for factor loadings = 0.3.
  • Figure 4: EFA results when we extract 3 factors from the 7 generative VQA tasks and the 7 MC VQA tasks separately. We merge the results for display. Cut-off for factor loadings = 0.6.
  • Figure 5: Hierarchical clustering of target tasks.