Table of Contents
Fetching ...

Squeeze Out Tokens from Sample for Finer-Grained Data Governance

Weixiong Lin, Chen Ju, Haicheng Wang, Shengchao Hu, Shuai Xiao, Mengting Chen, Yuheng Jiao, Mingshuai Yao, Jinsong Lan, Qingwen Liu, Ying Chen

TL;DR

Data scale exhibits diminishing returns, motivating data governance to prune non-informative tokens. The authors propose DataJuicer, a dual-branch, token-level data governor that condenses images by retaining informative patches and enriches captions with visual evidence, then pretrains vision-language models on the refined data. They formalize the framework with $\mathbb{D}$, $\mathbb{V}=\Psi(\mathbb{D})$, and a two-branch condensation mechanism producing $I'$ and $T'$, optimized via an ITC objective on $\mathbb{V}$. Empirically, DataJuicer outperforms DataSieve and other baselines on image-text retrieval, classification, and dense visual reasoning across multiple datasets and scales, while enabling faster inference and better data efficiency.

Abstract

Widely observed data scaling laws, in which error falls off as a power of the training size, demonstrate the diminishing returns of unselective data expansion. Hence, data governance is proposed to downsize datasets through pruning non-informative samples. Yet, isolating the impact of a specific sample on overall model performance is challenging, due to the vast computation required for tryout all sample combinations. Current data governors circumvent this complexity by estimating sample contributions through heuristic-derived scalar scores, thereby discarding low-value ones. Despite thorough sample sieving, retained samples contain substantial undesired tokens intrinsically, underscoring the potential for further compression and purification. In this work, we upgrade data governance from a 'sieving' approach to a 'juicing' one. Instead of scanning for least-flawed samples, our dual-branch DataJuicer applies finer-grained intra-sample governance. It squeezes out informative tokens and boosts image-text alignments. Specifically, the vision branch retains salient image patches and extracts relevant object classes, while the text branch incorporates these classes to enhance captions. Consequently, DataJuicer yields more refined datasets through finer-grained governance. Extensive experiments across datasets demonstrate that DataJuicer significantly outperforms existing DataSieve in image-text retrieval, classification, and dense visual reasoning.

Squeeze Out Tokens from Sample for Finer-Grained Data Governance

TL;DR

Data scale exhibits diminishing returns, motivating data governance to prune non-informative tokens. The authors propose DataJuicer, a dual-branch, token-level data governor that condenses images by retaining informative patches and enriches captions with visual evidence, then pretrains vision-language models on the refined data. They formalize the framework with , , and a two-branch condensation mechanism producing and , optimized via an ITC objective on . Empirically, DataJuicer outperforms DataSieve and other baselines on image-text retrieval, classification, and dense visual reasoning across multiple datasets and scales, while enabling faster inference and better data efficiency.

Abstract

Widely observed data scaling laws, in which error falls off as a power of the training size, demonstrate the diminishing returns of unselective data expansion. Hence, data governance is proposed to downsize datasets through pruning non-informative samples. Yet, isolating the impact of a specific sample on overall model performance is challenging, due to the vast computation required for tryout all sample combinations. Current data governors circumvent this complexity by estimating sample contributions through heuristic-derived scalar scores, thereby discarding low-value ones. Despite thorough sample sieving, retained samples contain substantial undesired tokens intrinsically, underscoring the potential for further compression and purification. In this work, we upgrade data governance from a 'sieving' approach to a 'juicing' one. Instead of scanning for least-flawed samples, our dual-branch DataJuicer applies finer-grained intra-sample governance. It squeezes out informative tokens and boosts image-text alignments. Specifically, the vision branch retains salient image patches and extracts relevant object classes, while the text branch incorporates these classes to enhance captions. Consequently, DataJuicer yields more refined datasets through finer-grained governance. Extensive experiments across datasets demonstrate that DataJuicer significantly outperforms existing DataSieve in image-text retrieval, classification, and dense visual reasoning.

Paper Structure

This paper contains 17 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: (A) Data Sieve relies on multi-modal foundation models(FMs) to retain high-value samples. The hurdle of sample value estimation necessitates the use of manually-designed strategies for governance, limiting Data Sieve's generalizability across datasets. (B) Data Juicer employs the vision FMs to retain informative image patches, and the text FMs to enhance captions by incorporating visual semantics. The automatic pipeline yields more accurate and generalizable sample refinement through finer-grained governance.
  • Figure 2: Pipeline Overview. Our framework employs vision and text branches to reduce undesired ingredients in real-world data. Vision Foundation Model (FM) is used to discard image patches with low contribution to overall semantics and to extract object classes. Text Foundation Model (FM) rectify grammatical errors and refine the textual descriptions By incorporating high-confidence class names, the text branch enhances image-text consistency. This process results in data that is both less redundant and less noisy, therefore leading to more effective training of Vision-Language Pretraining.
  • Figure 3: The highest DataSieve results are marked in pink. Data Scaling (Left). With training data scaled from CC3M to CC12M, a clear gap persists throughout training, suggesting DataJuicer’s wiser computation investment on informative tokens. Model Scaling (Middle). DataJuicer outperforms DataSieve across CLIP sizes (from S to L), showing large VLP models also learn better on juiced data. Schedule Scaling (Right). We train the same CLIP longer up to 82G sampled tokens (epochs of 12M data).
  • Figure 4: Generalization Across Data. DataJuicer performs better on both well-cleaned datasets and larger noisy datasets (larger markers). We choose one typical benchmark from each downstream task, i.e., MSCOCO for retrieval, ImageNet-1K for classification and MMStar for MLLM evaluation.
  • Figure 5: Generalization Across Architectures. For fair comparison, we train models with tokens of equal size, and then report zero-shot performances on ImageNet-1K.
  • ...and 2 more figures