Table of Contents
Fetching ...

TADS: Task-Aware Data Selection for Multi-Task Multimodal Pre-Training

Guanjie Cheng, Boyi Li, Lingyu Sun, Mengying Zhu, Yangyang Wu, Xinkui Zhao, Shuiguang Deng

TL;DR

The paper tackles the data-quality bottleneck in large-scale multimodal pre-training by addressing noisy web data rather than merely increasing data volume. It introduces TADS, a task-aware data selection framework that unifies Intrinsic Quality, Task Relevance, and Distributional Diversity through a Data Value Network and a bi-level feedback loop. The methodology combines multi-layer deduplication, a hybrid weak-supervision quality estimator, task-prototype relevance, cluster-aware diversity, and proxy-based performance signals to optimize sample selection across multiple downstream tasks. On CC12M, TADS delivers superior zero-shot performance using only 36% of the data, with average gains around 1.0% across tasks, demonstrating meaningful improvements in data efficiency and multi-task generalization for multimodal models.

Abstract

Large-scale multimodal pre-trained models like CLIP rely heavily on high-quality training data, yet raw web-crawled datasets are often noisy, misaligned, and redundant, leading to inefficient training and suboptimal generalization. Existing data selection methods are either heuristic-based, suffering from bias and limited diversity, or data-driven but task-agnostic, failing to optimize for multi-task scenarios. To address these gaps, we introduce TADS (Task-Aware Data Selection), a novel framework for multi-task multimodal pre-training that integrates Intrinsic Quality, Task Relevance, and Distributional Diversity into a learnable value function. TADS employs a comprehensive quality assessment system with unimodal and cross-modal operators, quantifies task relevance via interpretable similarity vectors, and optimizes diversity through cluster-based weighting. A feedback-driven meta-learning mechanism adaptively refines the selection strategy based on proxy model performance across multiple downstream tasks. Experiments on CC12M demonstrate that TADS achieves superior zero-shot performance on benchmarks like ImageNet, CIFAR-100, MS-COCO, and Flickr30K, using only 36% of the data while outperforming baselines by an average of 1.0%. This highlights that TADS significantly enhances data efficiency by curating a high-utility subset that yields a much higher performance ceiling within the same computational constraints.

TADS: Task-Aware Data Selection for Multi-Task Multimodal Pre-Training

TL;DR

The paper tackles the data-quality bottleneck in large-scale multimodal pre-training by addressing noisy web data rather than merely increasing data volume. It introduces TADS, a task-aware data selection framework that unifies Intrinsic Quality, Task Relevance, and Distributional Diversity through a Data Value Network and a bi-level feedback loop. The methodology combines multi-layer deduplication, a hybrid weak-supervision quality estimator, task-prototype relevance, cluster-aware diversity, and proxy-based performance signals to optimize sample selection across multiple downstream tasks. On CC12M, TADS delivers superior zero-shot performance using only 36% of the data, with average gains around 1.0% across tasks, demonstrating meaningful improvements in data efficiency and multi-task generalization for multimodal models.

Abstract

Large-scale multimodal pre-trained models like CLIP rely heavily on high-quality training data, yet raw web-crawled datasets are often noisy, misaligned, and redundant, leading to inefficient training and suboptimal generalization. Existing data selection methods are either heuristic-based, suffering from bias and limited diversity, or data-driven but task-agnostic, failing to optimize for multi-task scenarios. To address these gaps, we introduce TADS (Task-Aware Data Selection), a novel framework for multi-task multimodal pre-training that integrates Intrinsic Quality, Task Relevance, and Distributional Diversity into a learnable value function. TADS employs a comprehensive quality assessment system with unimodal and cross-modal operators, quantifies task relevance via interpretable similarity vectors, and optimizes diversity through cluster-based weighting. A feedback-driven meta-learning mechanism adaptively refines the selection strategy based on proxy model performance across multiple downstream tasks. Experiments on CC12M demonstrate that TADS achieves superior zero-shot performance on benchmarks like ImageNet, CIFAR-100, MS-COCO, and Flickr30K, using only 36% of the data while outperforming baselines by an average of 1.0%. This highlights that TADS significantly enhances data efficiency by curating a high-utility subset that yields a much higher performance ceiling within the same computational constraints.
Paper Structure (31 sections, 23 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 31 sections, 23 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Motivation: Quality outweighs Quantity. We compare the zero-shot performance of CLIP models trained on the full, uncurated CC12M dataset versus a compact subset selected by TADS. As illustrated, although TADS reduces the training data scale by 64%, it consistently achieves superior accuracy and recall across all 8 downstream benchmarks. This foundation experiment demonstrates that task-irrelevant noise, rather than data scarcity, is the primary bottleneck in multimodal pre-training, validating the necessity of our proposed task-aware selection framework.
  • Figure 2: Overview of the proposed TADS framework. The framework operates through a cascade of three integrated stages: (1) Multi-Layer Data Deduplication, which physically filters redundancy; (2) Multi-Dimensional Value Characterization, which encodes data utility from quality, relevance, and diversity perspectives; (3) Feedback-driven Meta-Learning, a bi-level learning mechanism that autonomously updates the selection strategy based on proxy model feedback.
  • Figure 3: Architecture of the Intrinsic Quality Evaluator. It integrates comprehensive unimodal and cross-modal operators to encode quality features. The MLP predictor is optimized via a hybrid learning framework that synergizes Snorkel-based weak supervision with ground-truth guidance to quantify sample quality.
  • Figure 4: Detailed analysis of deduplication and optimization. (a) Deduplication Efficiency: Performance consistently improves as redundancy is removed, even as the data size decreases significantly. (b) Optimization Strategy: TADS outperforms heuristic and random search methods across all metrics, validating the effectiveness of the gradient estimation.
  • Figure 5: Zero-shot performance comparison on 4 downstream tasks across different training scales (from 16M to 128M samples seen). TADS consistently achieves higher accuracy and recall with the same computational budget compared to state-of-the-art data selection methods.
  • ...and 1 more figures