Table of Contents
Fetching ...

Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning

Yiping Wang, Yifang Chen, Wendan Yan, Kevin Jamieson, Simon Shaolei Du

TL;DR

Variance Alignment Score (VAS) introduces a data-distribution-aware criterion for selecting informative multimodal CLIP training samples by aligning training covariance with a test-prior covariance $\bar{\Sigma}_{test}$. The method uses a two-stage filtering strategy that first removes low-quality data with a CLIP-based score, then selects samples by maximizing the total VAS, often using ImageNet-1k as the test prior or a dynamic train-based prior (VAS-D); a vision-only variant is favored when text embeddings are noisy. The authors provide a theoretical interpretation under linear-model assumptions and derive a generalization bound showing the VAS term dominates as data size grows, complemented by empirical gains: about $1.3\%$ on DataComp and $2.2\%$ on CC12M across 38 tasks, with ablations highlighting the superiority of visual embeddings for VAS. The work demonstrates that covariances between training and test distributions can guide data selection more effectively than purely sample-quality metrics, enhancing practical scalability for noisy web-curated data.

Abstract

In recent years, data selection has emerged as a core issue for large-scale visual-language model pretraining, especially on noisy web-curated datasets. One widely adopted strategy assigns quality scores such as CLIP similarity for each sample and retains the data pairs with the highest scores. However, these approaches are agnostic of data distribution and always fail to select the most informative samples. To solve this problem, we propose a simple yet theoretically principled metric named Variance Alignment Score (VAS), which has the form $\langle Σ_{\text{test}}, Σ_i\rangle$. Here, $Σ_{\text{test}}$ represents the target (cross-)covariance matrix we aim to align, potentially based on prior knowledge, while $Σ_i$ denotes the tensor product of single or multi-modal representations for the $i$-th sample. We further design a new data selection method that maximizes the total VAS. We provide theoretical analysis in a simplified setting to demonstrate the theoretical advantage of VAS over random or other existing data selection. Experimentally, applying VAS and CLIP scores together can outperform baselines by a margin of $1.3\%$ average on 38 evaluation sets for noisy dataset DataComp and $2.5\%$ on VTAB for high-quality dataset CC12M. Additionally, our ablation study also shows visual features are better than text for calculating VAS, and the related classical experimental design methods may fail under this context.

Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning

TL;DR

Variance Alignment Score (VAS) introduces a data-distribution-aware criterion for selecting informative multimodal CLIP training samples by aligning training covariance with a test-prior covariance . The method uses a two-stage filtering strategy that first removes low-quality data with a CLIP-based score, then selects samples by maximizing the total VAS, often using ImageNet-1k as the test prior or a dynamic train-based prior (VAS-D); a vision-only variant is favored when text embeddings are noisy. The authors provide a theoretical interpretation under linear-model assumptions and derive a generalization bound showing the VAS term dominates as data size grows, complemented by empirical gains: about on DataComp and on CC12M across 38 tasks, with ablations highlighting the superiority of visual embeddings for VAS. The work demonstrates that covariances between training and test distributions can guide data selection more effectively than purely sample-quality metrics, enhancing practical scalability for noisy web-curated data.

Abstract

In recent years, data selection has emerged as a core issue for large-scale visual-language model pretraining, especially on noisy web-curated datasets. One widely adopted strategy assigns quality scores such as CLIP similarity for each sample and retains the data pairs with the highest scores. However, these approaches are agnostic of data distribution and always fail to select the most informative samples. To solve this problem, we propose a simple yet theoretically principled metric named Variance Alignment Score (VAS), which has the form . Here, represents the target (cross-)covariance matrix we aim to align, potentially based on prior knowledge, while denotes the tensor product of single or multi-modal representations for the -th sample. We further design a new data selection method that maximizes the total VAS. We provide theoretical analysis in a simplified setting to demonstrate the theoretical advantage of VAS over random or other existing data selection. Experimentally, applying VAS and CLIP scores together can outperform baselines by a margin of average on 38 evaluation sets for noisy dataset DataComp and on VTAB for high-quality dataset CC12M. Additionally, our ablation study also shows visual features are better than text for calculating VAS, and the related classical experimental design methods may fail under this context.
Paper Structure (42 sections, 7 theorems, 39 equations, 4 figures, 5 tables, 2 algorithms)

This paper contains 42 sections, 7 theorems, 39 equations, 4 figures, 5 tables, 2 algorithms.

Key Result

Lemma 5.1

With high probability at least $1-\frac{1}{|S|d}$, suppose the hind-side best subset has at least $\underline{n}$ number of samples, then we have

Figures (4)

  • Figure 1: Illustration of our VAS on DataComp. $\bar{f}$ means the embeddings calculated by the pre-trained model, and $\Sigma_{\text{prior}}$ matrix is the test prior co-variance matrix, can be chosen as the co-variance matrix of the image embeddings from ImageNet-1k or DataComp itself. (a) Visualization of data with different Variance Alignment Score (VAS) and CLIP score in DataComp. CLIP score does not efficiently evaluate the informativeness of image-text pairs. Data from OCR tasks or unconventional visual tasks (Type 4) can have high quality but little useful visual information. However, VAS can select the data with more meaningful image features (Type 1 and Type 2). (b) Illustration of a rough comparison of sampling data for different filtering methods. Using VAS $\cap$ CLIP score filtering can balance the informativeness and quality and increase the proportion of Type 2 data, which are the most helpful data for training image-text models. Please refer to Appendix. \ref{['sec: add_vis']} for more visualization results.
  • Figure 2: Data distribution on VAS and CLIP score. We randomly sample 5000 points in DataComp and show its corresponding VAS and CLIP score, here VAS is calculated by the image embeddings from ImageNet-1k.
  • Figure 3: Visualization of image data with different CLIP scores and VAS in DataComp.
  • Figure 4: Visualization of data pairs with different CLIP scores and VAS in DataComp. Here 'img1k_vas' means VAS(ImageNet-1k) and 'self_vas' denotes VAS(DataComp). We can see that for most of the data, VAS(DataComp) is always similar to VAS(ImageNet-1k).

Theorems & Definitions (13)

  • Lemma 5.1: Intuition behind VAS
  • proof : Proof sketch
  • Theorem 5.2: Main
  • Lemma 1.1
  • proof
  • Lemma 1.2
  • proof
  • Lemma 1.3: Intuition behind VAS
  • proof
  • Theorem 1.4: Main
  • ...and 3 more