Table of Contents
Fetching ...

Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models

Junjie Li, Ziao Wang, Jianghong Ma, Xiaofeng Zhang

TL;DR

This work tackles data efficiency in instruction tuning for large vision–language models by shifting data curation from task-driven heuristics to intrinsic capability analysis. It introduces CADC, which first discovers intrinsic capabilities $\mathcal{C}=\{c_1,\dots, c_K\}$ from gradient-based learning trajectories, then attributes training data to these capabilities via trajectory influence, and finally curates a capability-aware curriculum with balanced budgets and staged sequencing. The authors demonstrate that with as little as 5% of the original data CADC can match or surpass full-data performance on diverse multimodal benchmarks, with robust transfer across model scales and datasets. They show that the three discovered capabilities—$c_1$, $c_2$, and $c_3$—balance structural grounding, perceptual recognition, and symbolic reasoning, offering a principled framework for instruction data curation.

Abstract

Large vision-language models (VLMs) achieve strong benchmark performance, but controlling their behavior through instruction tuning remains difficult. Reducing the budget of instruction tuning dataset often causes regressions, as heuristic strategies treat models as black boxes and overlook the latent capabilities that govern learning. We introduce Capability-Attributed Data Curation (CADC), a framework that shifts curation from task-specific heuristics to intrinsic capability analysis. CADC discovers intrinsic capabilities in an unsupervised manner from gradient-based learning trajectories, attributes training data to these capabilities via influence estimation, and curates capability-aware curricula through balanced selection and staged sequencing. This transforms black-box instruction tuning into a controllable, capability-driven process. With as little as 5% of the original data, CADC surpasses full-data training on multimodal benchmarks. These results validate intrinsic capabilities as the fundamental building blocks of model learning and establish CADC as a principle paradigm for instruction data curation.

Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models

TL;DR

This work tackles data efficiency in instruction tuning for large vision–language models by shifting data curation from task-driven heuristics to intrinsic capability analysis. It introduces CADC, which first discovers intrinsic capabilities from gradient-based learning trajectories, then attributes training data to these capabilities via trajectory influence, and finally curates a capability-aware curriculum with balanced budgets and staged sequencing. The authors demonstrate that with as little as 5% of the original data CADC can match or surpass full-data performance on diverse multimodal benchmarks, with robust transfer across model scales and datasets. They show that the three discovered capabilities—, , and —balance structural grounding, perceptual recognition, and symbolic reasoning, offering a principled framework for instruction data curation.

Abstract

Large vision-language models (VLMs) achieve strong benchmark performance, but controlling their behavior through instruction tuning remains difficult. Reducing the budget of instruction tuning dataset often causes regressions, as heuristic strategies treat models as black boxes and overlook the latent capabilities that govern learning. We introduce Capability-Attributed Data Curation (CADC), a framework that shifts curation from task-specific heuristics to intrinsic capability analysis. CADC discovers intrinsic capabilities in an unsupervised manner from gradient-based learning trajectories, attributes training data to these capabilities via influence estimation, and curates capability-aware curricula through balanced selection and staged sequencing. This transforms black-box instruction tuning into a controllable, capability-driven process. With as little as 5% of the original data, CADC surpasses full-data training on multimodal benchmarks. These results validate intrinsic capabilities as the fundamental building blocks of model learning and establish CADC as a principle paradigm for instruction data curation.

Paper Structure

This paper contains 32 sections, 21 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Motivation and capability analysis of CADC. Left: CADC disentangles mixed training data into groups aligned with intrinsic model capabilities and allocates them in a principled manner to support downstream tasks. Right: SmolVLM capability performance across $c_1$, $c_2$, and $c_3$, showing that CADC improves the model’s capabilities in a balanced manner.
  • Figure 2: Overview of the CADC pipeline. The framework operates in three phases: (1) Discovery identifies intrinsic capabilities by clustering gradient-based learning trajectories of target data; (2) Attribution maps training samples to these capabilities through trajectory influence analysis, forming capability-specific data pools; (3) Curation leverages self-influence signals to allocate budgets and sequence data, enabling capability-aware curricula. The three discovered capabilities—structural grounding ($c_1$), perceptual recognition ($c_2$), and symbolic reasoning ($c_3$)—serve as the foundation for balanced and interpretable data curation.
  • Figure 3: Intrinsic capabilities discovered on MMT-Bench.
  • Figure 4: Influence of instruction training data. Left: Sankey diagram plots trajectory influence $\operatorname{Inf}^{\text{Traj}}$ from the training pool $\mathcal{D}^{(k)}_{\text{train}}$ to capabilities $c_k$, with link thickness proportional to magnitude. Right: evolution of self-influence $\operatorname{Inf}^{\text{Self}}$, where lines trace trends and bars show change rates.
  • Figure 5: Proportion of the 162 subtasks assigned to each capability.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Definition A.1: Intrinsic Capability