Table of Contents
Fetching ...

Temperature-Free Loss Function for Contrastive Learning

Bum Jun Kim, Sang Woo Kim

TL;DR

This paper tackles the sensitivity and hyperparameter burden of temperature scaling in InfoNCE-based contrastive learning. It introduces a temperature-free loss by mapping cosine similarities through the log-odds function, equivalently $2\,\operatorname{artanh}(\cos \theta)$, and feeding these into the softmax. The authors provide a theoretical analysis showing that temperature division can cause gradient issues, while the proposed log-odds mapping preserves alive gradients and zero gradients only at the optimum. Empirically, the method matches or surpasses temperature-based baselines across five benchmarks, including image classification, graph representation, anomaly detection, NLP, and sequential recommendation, with the added benefit of hyperparameter-free deployment.

Abstract

As one of the most promising methods in self-supervised learning, contrastive learning has achieved a series of breakthroughs across numerous fields. A predominant approach to implementing contrastive learning is applying InfoNCE loss: By capturing the similarities between pairs, InfoNCE loss enables learning the representation of data. Albeit its success, adopting InfoNCE loss requires tuning a temperature, which is a core hyperparameter for calibrating similarity scores. Despite its significance and sensitivity to performance being emphasized by several studies, searching for a valid temperature requires extensive trial-and-error-based experiments, which increases the difficulty of adopting InfoNCE loss. To address this difficulty, we propose a novel method to deploy InfoNCE loss without temperature. Specifically, we replace temperature scaling with the inverse hyperbolic tangent function, resulting in a modified InfoNCE loss. In addition to hyperparameter-free deployment, we observed that the proposed method even yielded a performance gain in contrastive learning. Our detailed theoretical analysis discovers that the current practice of temperature scaling in InfoNCE loss causes serious problems in gradient descent, whereas our method provides desirable gradient properties. The proposed method was validated on five benchmarks on contrastive learning, yielding satisfactory results without temperature tuning.

Temperature-Free Loss Function for Contrastive Learning

TL;DR

This paper tackles the sensitivity and hyperparameter burden of temperature scaling in InfoNCE-based contrastive learning. It introduces a temperature-free loss by mapping cosine similarities through the log-odds function, equivalently , and feeding these into the softmax. The authors provide a theoretical analysis showing that temperature division can cause gradient issues, while the proposed log-odds mapping preserves alive gradients and zero gradients only at the optimum. Empirically, the method matches or surpasses temperature-based baselines across five benchmarks, including image classification, graph representation, anomaly detection, NLP, and sequential recommendation, with the added benefit of hyperparameter-free deployment.

Abstract

As one of the most promising methods in self-supervised learning, contrastive learning has achieved a series of breakthroughs across numerous fields. A predominant approach to implementing contrastive learning is applying InfoNCE loss: By capturing the similarities between pairs, InfoNCE loss enables learning the representation of data. Albeit its success, adopting InfoNCE loss requires tuning a temperature, which is a core hyperparameter for calibrating similarity scores. Despite its significance and sensitivity to performance being emphasized by several studies, searching for a valid temperature requires extensive trial-and-error-based experiments, which increases the difficulty of adopting InfoNCE loss. To address this difficulty, we propose a novel method to deploy InfoNCE loss without temperature. Specifically, we replace temperature scaling with the inverse hyperbolic tangent function, resulting in a modified InfoNCE loss. In addition to hyperparameter-free deployment, we observed that the proposed method even yielded a performance gain in contrastive learning. Our detailed theoretical analysis discovers that the current practice of temperature scaling in InfoNCE loss causes serious problems in gradient descent, whereas our method provides desirable gradient properties. The proposed method was validated on five benchmarks on contrastive learning, yielding satisfactory results without temperature tuning.

Paper Structure

This paper contains 16 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of contrastive learning with InfoNCE loss with existing division by temperature scheme and proposed method. The cosine similarities $[C, -C]$ follow the scenario in the main text. When approaching the optimal point $C \rightarrow 1$, the division by temperature scheme raises several problems in gradient descent, whereas the proposed method ensures convergence. The graphs on the right represent the loss $L_1$.
  • Figure 2: Gradient scale corresponding to Eq. \ref{['eq:grad']}. Higher temperatures yielded a nonzero gradient scale near the optimal point $C \rightarrow 1$. However, lower temperatures yield vanishing gradients at nonoptimal points.
  • Figure 3: Gradient scale corresponding to Eq. \ref{['eq:grad']}. Lower $C$ should exhibit a nonzero gradient scale, but $C \rightarrow 1$ should yield a zero gradient scale; these conditions can only be satisfied for a precisely chosen temperature, such as $\tau=0.25$.
  • Figure 4: Gradient scale corresponding to Eq. \ref{['eq:multi']} for $\tau=0.25$. Although $N=2$ provides a near-zero gradient when approaching the optimal point $C \rightarrow 1$, other conditions such as $N=16$ yield a nonzero gradient scale, thereby affecting the valid temperature.
  • Figure 5: Gradient scale corresponding to Eq. \ref{['eq:proposed']}. The proposed method ensures the zero gradient when approaching the optimal point $C \rightarrow 1$ and nonzero gradient scales on other points with a monotonically decreasing function.

Theorems & Definitions (5)

  • Example 3.1
  • Example 3.2
  • Example 3.3
  • Example 3.4
  • Example 4.1