Table of Contents
Fetching ...

Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data

Haonan Wang, Minbin Huang, Runhui Huang, Lanqing Hong, Hang Xu, Tianyang Hu, Xiaodan Liang, Zhenguo Li, Hong Cheng, Kenji Kawaguchi

TL;DR

This paper addresses the inefficiency of improving CLIP-style vision-language models by introducing HELIP, a cost-effective, data-recycling approach that exploits hard text-image pairs within existing data through Hard Pair Mining (HPM) and a Hard Negative Margin Loss (HNML). By identifying pair-level hard instances and enforcing a margin-aware geometry in the joint space, HELIP enables continuous training that yields substantial gains in zero-shot classification, image-text retrieval, and linear probing without collecting new data. The method demonstrates consistent improvements across multiple pre-trained models and datasets, including ImageNet, CIFAR, and fine-grained benchmarks, and proves effective even when scaling data or using subset mining (FastHPM). Overall, HELIP offers a practical path to boost large multimodal models with minimal resource overhead, highlighting the value of maximizing information from existing training data.

Abstract

Contrastive Language-Image Pre-training (CLIP) has become the standard for cross-modal image-text representation learning. Improving CLIP typically requires additional data and retraining with new loss functions, but these demands raise resource and time costs, limiting practical use. In this work, we introduce HELIP, a cost-effective strategy that improves CLIP models by exploiting challenging text-image pairs within existing datasets in continuous training. This eliminates the need for additional data or extensive retraining. Moreover, HELIP integrates effortlessly into current training pipelines with minimal code modifications, allowing for quick and seamless implementation. On comprehensive benchmarks, HELIP consistently boosts existing models. In particular, within just two epochs of training, it improves zero-shot classification accuracy on ImageNet for SLIP models pre-trained on CC3M, CC12M, and YFCC15M datasets by 3.05%, 4.47%, and 10.1% , respectively. In addition, on fine-grained classification datasets, HELIP improves the zero-shot performance of CLIP and SLIP by an average of 8.4% and 18.6%, and their linear probe performance by an average of 9.5% and 3.0%. The code is publicly available at: https://github.com/haonan3/HELIP-NACCL-2025.git.

Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data

TL;DR

This paper addresses the inefficiency of improving CLIP-style vision-language models by introducing HELIP, a cost-effective, data-recycling approach that exploits hard text-image pairs within existing data through Hard Pair Mining (HPM) and a Hard Negative Margin Loss (HNML). By identifying pair-level hard instances and enforcing a margin-aware geometry in the joint space, HELIP enables continuous training that yields substantial gains in zero-shot classification, image-text retrieval, and linear probing without collecting new data. The method demonstrates consistent improvements across multiple pre-trained models and datasets, including ImageNet, CIFAR, and fine-grained benchmarks, and proves effective even when scaling data or using subset mining (FastHPM). Overall, HELIP offers a practical path to boost large multimodal models with minimal resource overhead, highlighting the value of maximizing information from existing training data.

Abstract

Contrastive Language-Image Pre-training (CLIP) has become the standard for cross-modal image-text representation learning. Improving CLIP typically requires additional data and retraining with new loss functions, but these demands raise resource and time costs, limiting practical use. In this work, we introduce HELIP, a cost-effective strategy that improves CLIP models by exploiting challenging text-image pairs within existing datasets in continuous training. This eliminates the need for additional data or extensive retraining. Moreover, HELIP integrates effortlessly into current training pipelines with minimal code modifications, allowing for quick and seamless implementation. On comprehensive benchmarks, HELIP consistently boosts existing models. In particular, within just two epochs of training, it improves zero-shot classification accuracy on ImageNet for SLIP models pre-trained on CC3M, CC12M, and YFCC15M datasets by 3.05%, 4.47%, and 10.1% , respectively. In addition, on fine-grained classification datasets, HELIP improves the zero-shot performance of CLIP and SLIP by an average of 8.4% and 18.6%, and their linear probe performance by an average of 9.5% and 3.0%. The code is publicly available at: https://github.com/haonan3/HELIP-NACCL-2025.git.
Paper Structure (33 sections, 7 equations, 10 figures, 10 tables, 2 algorithms)

This paper contains 33 sections, 7 equations, 10 figures, 10 tables, 2 algorithms.

Figures (10)

  • Figure 1: Hard Pair Mining (HPM). Choose hard pairs by optimizing the support set to maximize the agreement prediction of the target pair.
  • Figure 2: Hard Negative Margin Loss (HNML). Hard negative pairs (i.e., the golden retriever) are closer to the positive than the normal negative pairs.
  • Figure 3: Continuous training CLIP with Hard Pairs. For text-image pairs within a batch, we sample corresponding hard data from the preprocess hard pair set.
  • Figure 4: Zero-shot performance on ImageNet for models pre-trained on different dataset sizes.
  • Figure 5: Hard pairs from HPM and fastHPM. FastHPM produces high-quality hard pairs that compete with HPM.
  • ...and 5 more figures