Table of Contents
Fetching ...

Supervised Contrastive Learning with Hard Negative Samples

Ruijie Jiang, Thuan Nguyen, Prakash Ishwar, Shuchin Aeron

TL;DR

The paper addresses improving contrastive learning by combining supervised, label-aware sampling with hard-negative tilting, formulating hard-SCL (H-SCL). It shows that, in the limit of infinitely many negatives, the H-SCL loss is bounded above by the hard-UCL loss, $\mathcal{L}^{(\infty)}_{\text{H-SCL}} \leq \mathcal{L}^{(\infty)}_{\text{H-UCL}}$, under a key assumption. Empirically, H-SCL consistently outperforms UCL, H-UCL, and SCL on image and graph benchmarks, with H-SCL(β) often providing the strongest gains and Assumption 1 validated across datasets. The work provides both practical guidance for applying hard-negative sampling in supervised settings and theoretical insight into why hard negatives improve representation learning, while noting avenues for relaxing the assumptions and extending to non-asymptotic regimes.

Abstract

Through minimization of an appropriate loss function such as the InfoNCE loss, contrastive learning (CL) learns a useful representation function by pulling positive samples close to each other while pushing negative samples far apart in the embedding space. The positive samples are typically created using "label-preserving" augmentations, i.e., domain-specific transformations of a given datum or anchor. In absence of class information, in unsupervised CL (UCL), the negative samples are typically chosen randomly and independently of the anchor from a preset negative sampling distribution over the entire dataset. This leads to class-collisions in UCL. Supervised CL (SCL), avoids this class collision by conditioning the negative sampling distribution to samples having labels different from that of the anchor. In hard-UCL (H-UCL), which has been shown to be an effective method to further enhance UCL, the negative sampling distribution is conditionally tilted, by means of a hardening function, towards samples that are closer to the anchor. Motivated by this, in this paper we propose hard-SCL (H-SCL) {wherein} the class conditional negative sampling distribution {is tilted} via a hardening function. Our simulation results confirm the utility of H-SCL over SCL with significant performance gains {in downstream classification tasks.} Analytically, we show that {in the} limit of infinite negative samples per anchor and a suitable assumption, the {H-SCL loss} is upper bounded by the {H-UCL loss}, thereby justifying the utility of H-UCL {for controlling} the H-SCL loss in the absence of label information. Through experiments on several datasets, we verify the assumption as well as the claimed inequality between H-UCL and H-SCL losses. We also provide a plausible scenario where H-SCL loss is lower bounded by UCL loss, indicating the limited utility of UCL in controlling the H-SCL loss.

Supervised Contrastive Learning with Hard Negative Samples

TL;DR

The paper addresses improving contrastive learning by combining supervised, label-aware sampling with hard-negative tilting, formulating hard-SCL (H-SCL). It shows that, in the limit of infinitely many negatives, the H-SCL loss is bounded above by the hard-UCL loss, , under a key assumption. Empirically, H-SCL consistently outperforms UCL, H-UCL, and SCL on image and graph benchmarks, with H-SCL(β) often providing the strongest gains and Assumption 1 validated across datasets. The work provides both practical guidance for applying hard-negative sampling in supervised settings and theoretical insight into why hard negatives improve representation learning, while noting avenues for relaxing the assumptions and extending to non-asymptotic regimes.

Abstract

Through minimization of an appropriate loss function such as the InfoNCE loss, contrastive learning (CL) learns a useful representation function by pulling positive samples close to each other while pushing negative samples far apart in the embedding space. The positive samples are typically created using "label-preserving" augmentations, i.e., domain-specific transformations of a given datum or anchor. In absence of class information, in unsupervised CL (UCL), the negative samples are typically chosen randomly and independently of the anchor from a preset negative sampling distribution over the entire dataset. This leads to class-collisions in UCL. Supervised CL (SCL), avoids this class collision by conditioning the negative sampling distribution to samples having labels different from that of the anchor. In hard-UCL (H-UCL), which has been shown to be an effective method to further enhance UCL, the negative sampling distribution is conditionally tilted, by means of a hardening function, towards samples that are closer to the anchor. Motivated by this, in this paper we propose hard-SCL (H-SCL) {wherein} the class conditional negative sampling distribution {is tilted} via a hardening function. Our simulation results confirm the utility of H-SCL over SCL with significant performance gains {in downstream classification tasks.} Analytically, we show that {in the} limit of infinite negative samples per anchor and a suitable assumption, the {H-SCL loss} is upper bounded by the {H-UCL loss}, thereby justifying the utility of H-UCL {for controlling} the H-SCL loss in the absence of label information. Through experiments on several datasets, we verify the assumption as well as the claimed inequality between H-UCL and H-SCL losses. We also provide a plausible scenario where H-SCL loss is lower bounded by UCL loss, indicating the limited utility of UCL in controlling the H-SCL loss.
Paper Structure (14 sections, 3 theorems, 33 equations, 5 figures, 2 tables)

This paper contains 14 sections, 3 theorems, 33 equations, 5 figures, 2 tables.

Key Result

Proposition 1

Let $p$ be a probability distribution over $\mathcal{Z}$ and $\rho: \mathcal{Z} \rightarrow [0,\infty)$ a nonnegative function such that $\alpha := \mathbb{E}_{z\sim p}[\rho(z)] \in (0,\infty)$. Then, is also a probability distribution and for any measurable function $s: \mathcal{Z} \rightarrow \mathbb{R}$ we have

Figures (5)

  • Figure 1: Top-1 accuracy (in %) of UCL, H-UCL, SCL, and H-SCL on the CIFAR100 dataset.
  • Figure 2: Schematic illustration of negative sampling strategies under H-UCL, SCL, and H-SCL settings in classifying species of cat. Top row (SCL): the negative samples (red rings) are randomly sampled from the set of circle samples which belongs to different classes of the anchor (yellow triangle). Middle row (H-UCL): the negative samples (red rings) are only selected from the neighbors of the anchor (yellow triangle). Since H-UCL prefers samples that are close to the anchor, it may select false negative samples (green triangles) which come from the same class as the anchor. Bottom row (H-SCL): the negative samples (red rings) are selected such that they are not only the "true negative" samples (circle samples) but also are close to the anchor (yellow triangle).
  • Figure 3: For a given representation function $f$, anchor $x$ (yellow triangle) and a threshold $\tau$, $\mathcal{H}_{\textup{\tiny H-UCL}}(x,f,\tau)$ contains all the samples $x^-$ that satisfy the constraint $e^{g(x,x^-)} \geq \tau$ (samples within the solid-line circle in the figure) that which are difficult to distinguish from the anchor in the representation space. $\mathcal{H}_{\textup{\tiny HSCL}}(x,f,\tau)$ is a subset of $\mathcal{H}_{\textup{\tiny H-UCL}}(x,f,\tau)$ and only contains samples that are hard to distinguish from the anchor and have labels different from the anchor's (blue discs within the solid-line circle). $\mathcal{H}_{\textup{\tiny Hcol}}(x,f,\tau)$ only contains samples that are hard to distinguish from the anchor and have the same label as the anchor (triangles within the solid-line circle). The set $\mathcal{H}_{\textup{\tiny SCL}}(x)$ consists of all samples having labels different from the anchor's, irrespective of whether they are easy or hard to distinguish from the anchor (all blue discs).
  • Figure 4: Fraction of anchors satisfying Assumption \ref{['asp:key']} at the end of each epoch for $\tau = e^{-0.5}$ (first figure), $\tau = e^{-0.1}$ (second figure), $\beta = 1$ (third figure) and $\beta = 2$ (forth figure).
  • Figure 5: Comparison of four different loss functions across epochs, with $\tau = e^{-0.5}$ (first figure), $\tau = e^{-0.1}$ (second figure), $\beta = 1$ (third figure) and $\beta = 2$ (forth figure)

Theorems & Definitions (7)

  • Proposition 1
  • proof
  • Definition 1: Hardening function
  • Proposition 2
  • proof
  • Lemma 1
  • proof