Table of Contents
Fetching ...

DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs

Xuan Qi, Luxi He, Dan Roth, Xingyu Fu

Abstract

Conventional wisdom for selecting supervision data for multimodal large language models (MLLMs) is to prioritize datasets that appear similar to the target benchmark, such as text-intensive or vision-centric tasks. However, it remains unclear whether such intuitive similarity reliably predicts downstream performance gains. In this work, we take a first step toward answering a practical question: can we estimate the influence of a training dataset on a target benchmark before any training is performed? To investigate this question, we conduct an in-depth analysis of transfer across 14 vision-language datasets spanning 7 diverse tasks. Our results show that intuitive task similarity is an unreliable predictor of transferability, and that generalization depends more on the specific dataset than on its broad task category. Motivated by this finding, we propose DATAPROPHET, a simple and effective training-free metric that combines multimodal perplexity, similarity, and data diversity. Experiments show that DATAPROPHET produces supervision-data rankings that strongly correlate with rankings based on actual post-training performance gains, achieving a Kendall's tau of 86.0%. Moreover, DATAPROPHET enables better supervision-data selection, yielding up to 6.9% improvement over uniform selection, 1.4% over a state-of-the-art training-based baseline, and 0.2% above oracle selection based on experimental performance. Our code and data will be released.

DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs

Abstract

Conventional wisdom for selecting supervision data for multimodal large language models (MLLMs) is to prioritize datasets that appear similar to the target benchmark, such as text-intensive or vision-centric tasks. However, it remains unclear whether such intuitive similarity reliably predicts downstream performance gains. In this work, we take a first step toward answering a practical question: can we estimate the influence of a training dataset on a target benchmark before any training is performed? To investigate this question, we conduct an in-depth analysis of transfer across 14 vision-language datasets spanning 7 diverse tasks. Our results show that intuitive task similarity is an unreliable predictor of transferability, and that generalization depends more on the specific dataset than on its broad task category. Motivated by this finding, we propose DATAPROPHET, a simple and effective training-free metric that combines multimodal perplexity, similarity, and data diversity. Experiments show that DATAPROPHET produces supervision-data rankings that strongly correlate with rankings based on actual post-training performance gains, achieving a Kendall's tau of 86.0%. Moreover, DATAPROPHET enables better supervision-data selection, yielding up to 6.9% improvement over uniform selection, 1.4% over a state-of-the-art training-based baseline, and 0.2% above oracle selection based on experimental performance. Our code and data will be released.
Paper Structure (33 sections, 5 equations, 2 figures, 4 tables, 2 algorithms)

This paper contains 33 sections, 5 equations, 2 figures, 4 tables, 2 algorithms.

Figures (2)

  • Figure 1: Three major takeaways in DataProphet: (a) Surprisingly, human intuition about similarity between training data and test data turns out to be a poor indicator of model performance influence. In contrast, our designed DataProphet metric more reliably predicts the influence of training data on test benchmarks; (b) the impact of multimodal supervision is decided by specific individual datasets, rather than by broad task categories: datasets from the same task category do not necessarily help each other the most, and do not share similar influence on the same target benchmark. Here, OCR-VQA mishra2019ocr and Screen-QA hsiao2022screenqa are both OCR data, ChartQA masry2022chartqa and Chart2Text kantharaj2022chart are both chart question answering benchmarks, and Open-Spatial cheng2024spatialrgpt and CLEVR spatial johnson2017clevr are both spatial reasoning benchmarks; (c) DataProphet provides an effective approach for training-free data selection under fixed compute budgets (i.e. fixed number of total training samples). We compute influence of each supervision dataset based on a combined set of testing benchmarks and select data according to the influence ratio, yielding consistent improvements across 14 tasks, with average gains of +3.4% and +6.9% under real and synthetic data settings.
  • Figure 2: Data influence analysis under fixed compute. We conduct supervised fine-tuning on base model (InternVL 3 zhu2025internvl3) with each training (source) dataset individually, and evaluate relative performance gains on all test (target) sets. See details in \ref{['sec:cross_task_influence']}. We highlight several observations: (1) Data influence is not symmetric. (2) Data sources from the same task type do not always influence each other most. E.g., OCR-VQA improves ScreenQA (OCR task) by 17.88% vs. GeomVerse (map understanding task) by 21.74% (3) Data influence is not decided by task type, but by the individual dataset. E.g., text-rich tasks (ScreenQA) can influence vision-centric ones (Open-Spatial with gains of 12.86%) more than text-rich ones (ChartQA with gains of 5.61%).