Table of Contents
Fetching ...

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, Saining Xie

TL;DR

The paper shows that multimodal benchmarks are vulnerable to non-visual shortcuts that inflate scores without genuine visual understanding. It introduces Test-set Stress-Test (TsT), a dual diagnostic framework (LLM-based and RF-based) that trains on non-visual test-set data to quantify exploitability and produce per-sample bias scores $s(x)$, and Iterative Bias Pruning (IBP) to debias benchmarks by removing highly biased samples. Across VSI-Bench, CV-Bench, MMMU, and VideoMME, TsT reveals pervasive shortcuts, and VSI-Bench-Debiased demonstrates that removing biased samples yields a larger vision–blind gap and more robust evaluation. The work argues for systematic, adversarial benchmark refinement to ensure measures reflect genuine multimodal understanding rather than statistical pattern matching, with practical guidance for practitioners on diagnosing and mitigating biases.

Abstract

Robust benchmarks are crucial for evaluating Multimodal Large Language Models (MLLMs). Yet we find that models can ace many multimodal benchmarks without strong visual understanding, instead exploiting biases, linguistic priors, and superficial patterns. This is especially problematic for vision-centric benchmarks that are meant to require visual inputs. We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be. Designers should therefore try to ``game'' their own benchmarks first, using diagnostic and debiasing procedures to systematically identify and mitigate non-visual biases. Effective diagnosis requires directly ``training on the test set'' -- probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnose benchmark susceptibility using a ``Test-set Stress-Test'' (TsT) methodology. Our primary diagnostic tool involves fine-tuning a powerful Large Language Model via $k$-fold cross-validation on exclusively the non-visual, textual inputs of the test set to reveal shortcut performance and assign each sample a bias score $s(x)$. We complement this with a lightweight Random Forest-based diagnostic operating on hand-crafted features for fast, interpretable auditing. Second, we debias benchmarks by filtering high-bias samples using an ``Iterative Bias Pruning'' (IBP) procedure. Applying this framework to four benchmarks -- VSI-Bench, CV-Bench, MMMU, and VideoMME -- we uncover pervasive non-visual biases. As a case study, we apply our full framework to create VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider vision-blind performance gap than the original.

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

TL;DR

The paper shows that multimodal benchmarks are vulnerable to non-visual shortcuts that inflate scores without genuine visual understanding. It introduces Test-set Stress-Test (TsT), a dual diagnostic framework (LLM-based and RF-based) that trains on non-visual test-set data to quantify exploitability and produce per-sample bias scores , and Iterative Bias Pruning (IBP) to debias benchmarks by removing highly biased samples. Across VSI-Bench, CV-Bench, MMMU, and VideoMME, TsT reveals pervasive shortcuts, and VSI-Bench-Debiased demonstrates that removing biased samples yields a larger vision–blind gap and more robust evaluation. The work argues for systematic, adversarial benchmark refinement to ensure measures reflect genuine multimodal understanding rather than statistical pattern matching, with practical guidance for practitioners on diagnosing and mitigating biases.

Abstract

Robust benchmarks are crucial for evaluating Multimodal Large Language Models (MLLMs). Yet we find that models can ace many multimodal benchmarks without strong visual understanding, instead exploiting biases, linguistic priors, and superficial patterns. This is especially problematic for vision-centric benchmarks that are meant to require visual inputs. We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be. Designers should therefore try to ``game'' their own benchmarks first, using diagnostic and debiasing procedures to systematically identify and mitigate non-visual biases. Effective diagnosis requires directly ``training on the test set'' -- probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnose benchmark susceptibility using a ``Test-set Stress-Test'' (TsT) methodology. Our primary diagnostic tool involves fine-tuning a powerful Large Language Model via -fold cross-validation on exclusively the non-visual, textual inputs of the test set to reveal shortcut performance and assign each sample a bias score . We complement this with a lightweight Random Forest-based diagnostic operating on hand-crafted features for fast, interpretable auditing. Second, we debias benchmarks by filtering high-bias samples using an ``Iterative Bias Pruning'' (IBP) procedure. Applying this framework to four benchmarks -- VSI-Bench, CV-Bench, MMMU, and VideoMME -- we uncover pervasive non-visual biases. As a case study, we apply our full framework to create VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider vision-blind performance gap than the original.

Paper Structure

This paper contains 33 sections, 4 figures, 10 tables, 1 algorithm.

Figures (4)

  • Figure 1: The Evolving Landscape of Visual Understanding Benchmarks. As benchmarks evolved from controlled, narrow tasks to open-ended VQA, they gained expressivity but became vulnerable to non-visual shortcuts. Language-driven evaluation enables flexible querying but risks models exploiting linguistic patterns rather than visual understanding.
  • Figure 2: Knowledge-based shortcuts in multimodal benchmarks. Blind vs. vision-enabled performance across LLaVA-OneVision model scales. MMMU shows substantial gains from scaling the LLM backbone (x-axis) but minimal improvement from enabling vision (y-axis), indicating reliance on linguistic knowledge. VSI-Bench demonstrates the opposite pattern—large vision gains with negligible blind scaling—confirming robustness to knowledge-based shortcuts. VideoMME shows roughly equal gains from both sources, while CV-Bench benefits more from vision but still exhibits significant gains from LLM scaling.
  • Figure 3: Statistical biases create non-visual shortcuts across diverse multimodal benchmarks. (a) Counting tasks exhibit severe long-tailed answer distribution skews; (b) Spatial relation tasks show imbalanced answer frequencies, where certain object categories appear as correct answers disproportionately often; (c) Appearance order tasks have strong category-position correlations; and (d) Size estimation tasks follow predictable log-normal distributions. Such patterns enable achieving high accuracy without the visual input.
  • Figure 4: Test-set Training (TsT) targets intrinsic test-set vulnerabilities. (a) TsT directly probes biases intrinsic to the specific test set (pink region), rather than approximating them via external training data. (b) The test set is split into $k$ folds; a blind diagnostic model trains on $k{-}1$ folds and evaluates on the held-out fold, repeated $k$ times to yield (i) overall non-visual solvability and (ii) per-sample bias scores $s(x)$.