Table of Contents
Fetching ...

CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection

Xinlin Zhuang, Yichen Li, Xiwei Liu, Haolin Yang, Yifan Lu, Ziyun Zou, Yulong Li, Huifa Li, Dongliang Chen, Qinglei Wang, Weiyang Liu, Ying Qian, Jiangming Shi, Imran Razzak

TL;DR

CHIPS reframes CLIP adaptation as a data-selection problem and introduces a curvature-aware, end-point–based selector that integrates a Newton-style proxy alignment, a JL-sketched InfoNCE curvature estimator, and learnability plus domain-relevance weights. The method selects high-utility image-text pairs to CPT, achieving state-of-the-art results among selection baselines on medical benchmarks and matching full-dataset CPT with only 30% of the data, while preserving general-domain performance. Empirical results show CHIPS is effective across 31 general-domain tasks, with ablations confirming the benefit of combining alignment, margin, and relevance components. The work highlights that strategic, CLIP-aware data selection can substantially reduce data requirements for domain adaptation, with practical implications for efficient, targeted multimodal pre-training.

Abstract

Adapting CLIP to vertical domains is typically approached by novel fine-tuning strategies or by continual pre-training (CPT) on large domain-specific datasets. Yet, data itself remains an underexplored factor in this process. We revisit this task from a data-centric perspective: Can effective data selection substitute for large-scale datasets in CPT? We introduce CHIPS (Curvature-aware Hybrid Influence in Projection Subspace), which assigns each image-text pair a utility score that integrates three complementary factors aligned with three goals: faithfulness via a curvature-aware, Newton-style alignment computed in CLIP's end-point subspace; scalability via an InfoNCE-aware curvature estimator with Johnson-Lindenstrauss (JL) sketching; and retention via a selection-aware relevance weight combined with learnability to balance target adaptation against general-domain preservation. We justify this design theoretically by proving a lower-bound guarantee on the proxy's correlation with full-parameter alignment and by characterizing the bias-variance trade-offs introduced by curvature mixing and JL sketching. We evaluate CHIPS empirically across various settings: 1) CHIPS attains state-of-the-art performance among selection baselines on 17 medical benchmarks, matches full-dataset CPT with 30% of the data, and outperforms half-dataset CPT using only 10%; 2) on 31 general-domain benchmarks, CHIPS yields the smallest performance drop under 10-30% data-retention budgets. Code, data, and checkpoints will be released.

CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection

TL;DR

CHIPS reframes CLIP adaptation as a data-selection problem and introduces a curvature-aware, end-point–based selector that integrates a Newton-style proxy alignment, a JL-sketched InfoNCE curvature estimator, and learnability plus domain-relevance weights. The method selects high-utility image-text pairs to CPT, achieving state-of-the-art results among selection baselines on medical benchmarks and matching full-dataset CPT with only 30% of the data, while preserving general-domain performance. Empirical results show CHIPS is effective across 31 general-domain tasks, with ablations confirming the benefit of combining alignment, margin, and relevance components. The work highlights that strategic, CLIP-aware data selection can substantially reduce data requirements for domain adaptation, with practical implications for efficient, targeted multimodal pre-training.

Abstract

Adapting CLIP to vertical domains is typically approached by novel fine-tuning strategies or by continual pre-training (CPT) on large domain-specific datasets. Yet, data itself remains an underexplored factor in this process. We revisit this task from a data-centric perspective: Can effective data selection substitute for large-scale datasets in CPT? We introduce CHIPS (Curvature-aware Hybrid Influence in Projection Subspace), which assigns each image-text pair a utility score that integrates three complementary factors aligned with three goals: faithfulness via a curvature-aware, Newton-style alignment computed in CLIP's end-point subspace; scalability via an InfoNCE-aware curvature estimator with Johnson-Lindenstrauss (JL) sketching; and retention via a selection-aware relevance weight combined with learnability to balance target adaptation against general-domain preservation. We justify this design theoretically by proving a lower-bound guarantee on the proxy's correlation with full-parameter alignment and by characterizing the bias-variance trade-offs introduced by curvature mixing and JL sketching. We evaluate CHIPS empirically across various settings: 1) CHIPS attains state-of-the-art performance among selection baselines on 17 medical benchmarks, matches full-dataset CPT with 30% of the data, and outperforms half-dataset CPT using only 10%; 2) on 31 general-domain benchmarks, CHIPS yields the smallest performance drop under 10-30% data-retention budgets. Code, data, and checkpoints will be released.

Paper Structure

This paper contains 80 sections, 2 theorems, 60 equations, 9 figures, 35 tables, 1 algorithm.

Key Result

Theorem 1

If $\zeta(z)$ is uncorrelated with $\mathbf g_\vartheta(z)$, then

Figures (9)

  • Figure 1: Workflow of CHIPS. For each training sample, CHIPS computes a curvature-aware proxy Newton alignment in CLIP’s end-point space (projection heads and temperature), where curvature is approximated by mixing self and negative-pair cross moments from symmetric InfoNCE and scaled efficiently via JL sketching. The alignment is then modulated by learnability and target-domain relevance to yield a single selection utility, and the top-$n$ samples are chosen to CPT CLIP models for domain adaptation.
  • Figure 2: Downstream results of MetaCLIP-B16-400M continually pre-trained on 10% data selected from CHIPS under: (A) different evaluation set sizes, (B) different mixing $\alpha$ in computing alignment scores, and (C) different balance $\beta$ in computing relevance scores.
  • Figure 3: Medical downstream results of MetaCLIP-B16-400M continually pre-trained on 10% data selected from different selection methods under various end-point geometry settings.
  • Figure 4: Medical downstream results of MetaCLIP-B16-400M continually pre-trained on 10% data selected from CHIPS under various random projection settings.
  • Figure 5: Distribution of CLIPScore on BIOMEDICA.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 1: Proxy–full alignment correlation
  • Theorem 2: Error bound for curvature mixing