Table of Contents
Fetching ...

Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional

Divyam Madaan, Varshan Muhunthan, Kyunghyun Cho, Sumit Chopra

TL;DR

This work provides a quantitative characterization of multi-modal datasets, enabling a principled approach to multi-modal benchmark design and evaluation and discovers that numerous benchmarks intended to mitigate text-only biases have inadvertently amplified image-only dependencies.

Abstract

Understanding the interplay between intra-modality dependencies (the contribution of an individual modality to a target task) and inter-modality dependencies (the relationships between modalities and the target task) is fundamental to advancing multi-modal learning. However, the nature of and interaction between these dependencies within current benchmark evaluations remains poorly characterized. In this work, we present a large-scale empirical study to quantify these dependencies across 23 visual question-answering benchmarks using multi-modal large language models (MLLMs) covering domains such as general and expert knowledge reasoning, optical character recognition, and document understanding. Our findings show that the reliance on vision, question (text), and their interaction varies significantly, both across and within benchmarks. We discover that numerous benchmarks intended to mitigate text-only biases have inadvertently amplified image-only dependencies. This characterization persists across model sizes and types, with models often obtaining high performance by using each modality independently and showing limited dependence on their interaction. We provide a quantitative characterization of multi-modal datasets, enabling a principled approach to multi-modal benchmark design and evaluation.

Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional

TL;DR

This work provides a quantitative characterization of multi-modal datasets, enabling a principled approach to multi-modal benchmark design and evaluation and discovers that numerous benchmarks intended to mitigate text-only biases have inadvertently amplified image-only dependencies.

Abstract

Understanding the interplay between intra-modality dependencies (the contribution of an individual modality to a target task) and inter-modality dependencies (the relationships between modalities and the target task) is fundamental to advancing multi-modal learning. However, the nature of and interaction between these dependencies within current benchmark evaluations remains poorly characterized. In this work, we present a large-scale empirical study to quantify these dependencies across 23 visual question-answering benchmarks using multi-modal large language models (MLLMs) covering domains such as general and expert knowledge reasoning, optical character recognition, and document understanding. Our findings show that the reliance on vision, question (text), and their interaction varies significantly, both across and within benchmarks. We discover that numerous benchmarks intended to mitigate text-only biases have inadvertently amplified image-only dependencies. This characterization persists across model sizes and types, with models often obtaining high performance by using each modality independently and showing limited dependence on their interaction. We provide a quantitative characterization of multi-modal datasets, enabling a principled approach to multi-modal benchmark design and evaluation.

Paper Structure

This paper contains 21 sections, 1 equation, 11 figures.

Figures (11)

  • Figure 1: Demonstration of intra-modality dependencies in multi-modal models using input permutation. (Left) The models answers about layers of Earth even when the image is replaced by an unrelated diagram of a brain. (Right) The model identifies a symbiotic relationship from the image even when the question is unrelated. These examples highlight a failure of multi-modal reasoning, where models exploit uni-modal priors with the options to obtain an associated answer.
  • Figure 2: Radar plot showing the comparison of an ensemble of standard MLLMs with image only, text only and random performance using the recipe from \ref{['sec:recipe']}. The dashed line indicates human performance, which is shown partially due to a lack of data for other benchmarks.
  • Figure 3: Effect of Model Scaling on Modality Contribution. Performance of various models (8B, 13B, 34B, and a majority-vote ensemble) on four datasets selected for their specific dependencies: GQA (text), SEED (image), and POPE (inter-modality). The bars represent standard accuracy and attributed contributions from text, image, and random (bars are in the same order).
  • Figure 4: Effect of model type on modality contribution. Performance comparison between LLava-Next (May 2024), Cambrian-1 8b (June 2024), Qwen2.5-VL (April 2025) and Qwen3-VL (October 2025) on four datasets selected for their specific dependencies: GQA (text), MMBench (image), POPE (inter-modality) and MMMU-Pro (both image and text). The bars represent standard accuracy and attributed contributions from text, image, and random (bars are in the same order).
  • Figure 5: Analysis of sub-categories across datasets showing dependency on individual modalities. Although benchmarks may be designed for inter-modality reasoning, we show a strong dependence on text for categories such as relative location in ADE and COCO, or higher-grade questions in ScienceQA and multiple categories in MMMU and MMMUPro. This highlights how aggregate metrics can obscure that many instances may not require multi-modal reasoning. We show standard accuracy in yellow and contributions from text in blue, image in green, and random in orange.
  • ...and 6 more figures