Table of Contents
Fetching ...

$\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs

Vlad Sobal, Mark Ibrahim, Randall Balestriero, Vivien Cabannes, Diane Bouchacourt, Pietro Astolfi, Kyunghyun Cho, Yann LeCun

TL;DR

X- Sample Contrastive reframes contrastive learning as learning over a soft cross-sample similarity graph, addressing the binary positive/negative limitation by encoding cross-sample relations via a soft adjacency built from captions or class descriptions. The proposed L_{X-CLR} objective, using soft targets derived from a text encoder, yields stronger and more data-efficient representations across ImageNet-scale data and large caption collections, including a 0.6% improvement over CLIP on CC12M for ImageNet and ImageNet Real, and notable gains in background disambiguation. The approach demonstrates robustness to data quality, with label quality playing a critical role in fine-grained attribute Disambiguation, and remains computationally economical by offline similarity precomputation. Together, these results suggest that enriching contrastive objectives with cross-sample semantic signals can produce richer, more generalizable foundation-model representations and can be leveraged for fine-tuning pretrained backbones with limited overhead.

Abstract

Learning good representations involves capturing the diverse ways in which data samples relate. Contrastive loss - an objective matching related samples - underlies methods from self-supervised to multimodal learning. Contrastive losses, however, can be viewed more broadly as modifying a similarity graph to indicate how samples should relate in the embedding space. This view reveals a shortcoming in contrastive learning: the similarity graph is binary, as only one sample is the related positive sample. Crucially, similarities \textit{across} samples are ignored. Based on this observation, we revise the standard contrastive loss to explicitly encode how a sample relates to others. We experiment with this new objective, called $\mathbb{X}$-Sample Contrastive, to train vision models based on similarities in class or text caption descriptions. Our study spans three scales: ImageNet-1k with 1 million, CC3M with 3 million, and CC12M with 12 million samples. The representations learned via our objective outperform both contrastive self-supervised and vision-language models trained on the same data across a range of tasks. When training on CC12M, we outperform CLIP by $0.6\%$ on both ImageNet and ImageNet Real. Our objective appears to work particularly well in lower-data regimes, with gains over CLIP of $16.8\%$ on ImageNet and $18.1\%$ on ImageNet Real when training with CC3M. Finally, our objective seems to encourage the model to learn representations that separate objects from their attributes and backgrounds, with gains of $3.3$-$5.6$\% over CLIP on ImageNet9. We hope the proposed solution takes a small step towards developing richer learning objectives for understanding sample relations in foundation models.

$\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs

TL;DR

X- Sample Contrastive reframes contrastive learning as learning over a soft cross-sample similarity graph, addressing the binary positive/negative limitation by encoding cross-sample relations via a soft adjacency built from captions or class descriptions. The proposed L_{X-CLR} objective, using soft targets derived from a text encoder, yields stronger and more data-efficient representations across ImageNet-scale data and large caption collections, including a 0.6% improvement over CLIP on CC12M for ImageNet and ImageNet Real, and notable gains in background disambiguation. The approach demonstrates robustness to data quality, with label quality playing a critical role in fine-grained attribute Disambiguation, and remains computationally economical by offline similarity precomputation. Together, these results suggest that enriching contrastive objectives with cross-sample semantic signals can produce richer, more generalizable foundation-model representations and can be leveraged for fine-tuning pretrained backbones with limited overhead.

Abstract

Learning good representations involves capturing the diverse ways in which data samples relate. Contrastive loss - an objective matching related samples - underlies methods from self-supervised to multimodal learning. Contrastive losses, however, can be viewed more broadly as modifying a similarity graph to indicate how samples should relate in the embedding space. This view reveals a shortcoming in contrastive learning: the similarity graph is binary, as only one sample is the related positive sample. Crucially, similarities \textit{across} samples are ignored. Based on this observation, we revise the standard contrastive loss to explicitly encode how a sample relates to others. We experiment with this new objective, called -Sample Contrastive, to train vision models based on similarities in class or text caption descriptions. Our study spans three scales: ImageNet-1k with 1 million, CC3M with 3 million, and CC12M with 12 million samples. The representations learned via our objective outperform both contrastive self-supervised and vision-language models trained on the same data across a range of tasks. When training on CC12M, we outperform CLIP by on both ImageNet and ImageNet Real. Our objective appears to work particularly well in lower-data regimes, with gains over CLIP of on ImageNet and on ImageNet Real when training with CC3M. Finally, our objective seems to encourage the model to learn representations that separate objects from their attributes and backgrounds, with gains of -\% over CLIP on ImageNet9. We hope the proposed solution takes a small step towards developing richer learning objectives for understanding sample relations in foundation models.
Paper Structure (36 sections, 1 theorem, 11 equations, 9 figures, 10 tables)

This paper contains 36 sections, 1 theorem, 11 equations, 9 figures, 10 tables.

Key Result

Theorem 1

VICReg bardes2021vicreg, SimCLR chen2020simple, and BarlowTwins zbontar2021barlow losses can be expressed in terms of the graph ${\bm{G}}$eq:G_ssl where $\tilde{{\bm{z}}} \triangleq {\bm{z}} / \left\| {\bm{z}} \right\|$ and $\tilde{{\bm{Z}}}$ the column normalized ${\bm{Z}}$ so that each column has unit norm.

Figures (9)

  • Figure 1: a) The diagram of $\mathbb{X}$-CLR. $\mathbb{X}$-CLR objective learns representations of images with the help of a soft relationship graph. The graph can be built based on accompanying data, e.g. taxonomy for biological data. In our experiments, we use captioned images, and build similarities based on the similarity of captions. b) Python-style pseudo-code of $\mathbb{X}$-CLR with similarity based on text captions.
  • Figure 2: Sample similarity adjacency matrices of existing methods vs. our $\mathbb{X}$-Sample Contrastive similarity loss (right). We show pairwise similarities of 20 samples belonging to 4 classes. Similarity of 1 means the samples are identical, 0 -- they are completely unrelated. In case of self-supervised learning, none of the inter-sample relationships are modelled (left). Supervised learning relies on the labels to group samples of the same class together (center). $\mathbb{X}$-CLR models inter-class relationships by associating cats with dogs and pianos with guitars.
  • Figure 3: (a) $\mathbb{X}$-Sample Contrastive Loss is data efficient with ImageNet pretraining. We outperform SimCLR in low data regimes and match Supervised Contrastive trained on ground truth labels at varying levels of data scarcity. (b) KNN performance ImageNet.$\mathbb{X}$-CLR outperforms other methods with KNN probing for a range of values of K. (c) Sensitivity of $\mathbb{X}$-Sample Contrastive to temperature. We test the performance of our method when trained with different values of temperature $\tau_s$ on ImageNet data.
  • Figure 4: Visualizing pairwise similarities SupCon khosla2020supervised objective does not encourage non-zero similarity between samples of different classes (left), while $\mathbb{X}$-CLR target similarities take into account semantic closeness within categories such as dogs or types of balls (center). On the right, we see that the trained model successfully learns the soft similarity. For more graphs, see \ref{['fig:sims_all']}.
  • Figure 5: Target and learned similarities for different graphs.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Theorem 1: cabannes2023active