Squeeze Out Tokens from Sample for Finer-Grained Data Governance
Weixiong Lin, Chen Ju, Haicheng Wang, Shengchao Hu, Shuai Xiao, Mengting Chen, Yuheng Jiao, Mingshuai Yao, Jinsong Lan, Qingwen Liu, Ying Chen
TL;DR
Data scale exhibits diminishing returns, motivating data governance to prune non-informative tokens. The authors propose DataJuicer, a dual-branch, token-level data governor that condenses images by retaining informative patches and enriches captions with visual evidence, then pretrains vision-language models on the refined data. They formalize the framework with $\mathbb{D}$, $\mathbb{V}=\Psi(\mathbb{D})$, and a two-branch condensation mechanism producing $I'$ and $T'$, optimized via an ITC objective on $\mathbb{V}$. Empirically, DataJuicer outperforms DataSieve and other baselines on image-text retrieval, classification, and dense visual reasoning across multiple datasets and scales, while enabling faster inference and better data efficiency.
Abstract
Widely observed data scaling laws, in which error falls off as a power of the training size, demonstrate the diminishing returns of unselective data expansion. Hence, data governance is proposed to downsize datasets through pruning non-informative samples. Yet, isolating the impact of a specific sample on overall model performance is challenging, due to the vast computation required for tryout all sample combinations. Current data governors circumvent this complexity by estimating sample contributions through heuristic-derived scalar scores, thereby discarding low-value ones. Despite thorough sample sieving, retained samples contain substantial undesired tokens intrinsically, underscoring the potential for further compression and purification. In this work, we upgrade data governance from a 'sieving' approach to a 'juicing' one. Instead of scanning for least-flawed samples, our dual-branch DataJuicer applies finer-grained intra-sample governance. It squeezes out informative tokens and boosts image-text alignments. Specifically, the vision branch retains salient image patches and extracts relevant object classes, while the text branch incorporates these classes to enhance captions. Consequently, DataJuicer yields more refined datasets through finer-grained governance. Extensive experiments across datasets demonstrate that DataJuicer significantly outperforms existing DataSieve in image-text retrieval, classification, and dense visual reasoning.
