Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional

Divyam Madaan; Varshan Muhunthan; Kyunghyun Cho; Sumit Chopra

Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional

Divyam Madaan, Varshan Muhunthan, Kyunghyun Cho, Sumit Chopra

TL;DR

This work provides a quantitative characterization of multi-modal datasets, enabling a principled approach to multi-modal benchmark design and evaluation and discovers that numerous benchmarks intended to mitigate text-only biases have inadvertently amplified image-only dependencies.

Abstract

Understanding the interplay between intra-modality dependencies (the contribution of an individual modality to a target task) and inter-modality dependencies (the relationships between modalities and the target task) is fundamental to advancing multi-modal learning. However, the nature of and interaction between these dependencies within current benchmark evaluations remains poorly characterized. In this work, we present a large-scale empirical study to quantify these dependencies across 23 visual question-answering benchmarks using multi-modal large language models (MLLMs) covering domains such as general and expert knowledge reasoning, optical character recognition, and document understanding. Our findings show that the reliance on vision, question (text), and their interaction varies significantly, both across and within benchmarks. We discover that numerous benchmarks intended to mitigate text-only biases have inadvertently amplified image-only dependencies. This characterization persists across model sizes and types, with models often obtaining high performance by using each modality independently and showing limited dependence on their interaction. We provide a quantitative characterization of multi-modal datasets, enabling a principled approach to multi-modal benchmark design and evaluation.

Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional

TL;DR

Abstract

Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)