Table of Contents
Fetching ...

A Survey on Data Curation for Visual Contrastive Learning: Why Crafting Effective Positive and Negative Pairs Matters

Shasvat Desai, Debasmita Ghose, Deep Chakraborty

TL;DR

Addressing the data curation bottleneck in visual contrastive learning, this paper provides a comprehensive taxonomy of positive and negative pair creation strategies. It systematically organizes methods into single-instance versus multi-instance positives and into hard, false, and synthetic negatives, with subcategories including embedding-based, synthetic, supervised, attribute-based, and cross-modal approaches. The analysis highlights key trade-offs between diversity and semantic relevance, as well as computational considerations, and discusses open questions for handling emerging modalities. The resulting framework offers practical guidance for designing informative, efficient contrastive representations with better downstream generalization.

Abstract

Visual contrastive learning aims to learn representations by contrasting similar (positive) and dissimilar (negative) pairs of data samples. The design of these pairs significantly impacts representation quality, training efficiency, and computational cost. A well-curated set of pairs leads to stronger representations and faster convergence. As contrastive pre-training sees wider adoption for solving downstream tasks, data curation becomes essential for optimizing its effectiveness. In this survey, we attempt to create a taxonomy of existing techniques for positive and negative pair curation in contrastive learning, and describe them in detail.

A Survey on Data Curation for Visual Contrastive Learning: Why Crafting Effective Positive and Negative Pairs Matters

TL;DR

Addressing the data curation bottleneck in visual contrastive learning, this paper provides a comprehensive taxonomy of positive and negative pair creation strategies. It systematically organizes methods into single-instance versus multi-instance positives and into hard, false, and synthetic negatives, with subcategories including embedding-based, synthetic, supervised, attribute-based, and cross-modal approaches. The analysis highlights key trade-offs between diversity and semantic relevance, as well as computational considerations, and discusses open questions for handling emerging modalities. The resulting framework offers practical guidance for designing informative, efficient contrastive representations with better downstream generalization.

Abstract

Visual contrastive learning aims to learn representations by contrasting similar (positive) and dissimilar (negative) pairs of data samples. The design of these pairs significantly impacts representation quality, training efficiency, and computational cost. A well-curated set of pairs leads to stronger representations and faster convergence. As contrastive pre-training sees wider adoption for solving downstream tasks, data curation becomes essential for optimizing its effectiveness. In this survey, we attempt to create a taxonomy of existing techniques for positive and negative pair curation in contrastive learning, and describe them in detail.

Paper Structure

This paper contains 22 sections, 3 figures.

Figures (3)

  • Figure 1: Taxonomy for crafting positive and negative pairs.
  • Figure 2: Positive Pair Curation Techniques: Positive pair selection can utilize single-instance and multi-instance techniques. (a) Single-instance curation applies augmentations to a single sample. On the other hand, multi-instance positive pair generation can be further classified into several category of techniques. (b) Embedding-based retrieves the top-K nearest neighbors of the anchor sample's augmentation in the embedding space and pairs them with other augmentations of the anchor. (c) Synthetic pairs generate data conditioned on the input, which is then augmented and paired with the augmented real sample (d) Supervised pairs use external sources (human labels, oracles, or annotations) to fetch another sample from the same category and create positive pairs. (e) Attributed-based: These methods group samples by shared attributes (e.g., golden retrievers paired with golden labrador retrievers based on fur color) and pair their respective augmentations. (f) Cross-modal: This involves creating semantically aligned pairs across multiple modalities. The figure shows image-text and speech-image pairing.
  • Figure 3: Negative Pair Curation Techniques: This figure shows three categories of techniques for negative pair curation. (a). Hard Negative Selection prioritizes negatives that are semantically similar to the anchor sample, such as a different cat breed, instead of an unrelated category like an airplane. The negatives are then augmented and fed into the encoder. (b). False Negative Elimination removes or reclassifies negatives that are highly similar to the anchor sample, preventing the model from mistakenly separating highly similar samples. The remaining negatives are then augmented before encoding. Hard negatives improve discrimination but risk overfitting, while false negative elimination reduces noise but may mistakenly remove challenging yet valid negatives, weakening the representations. (c). Synthetic negative pairs are created by feeding the positive and negative samples(dataset) into a generative process and conditioned on the anchor sample to create realistic but distinct negatives. The generated samples then undergo augmentation and are fed with the positive pairs to the downstream encoder.