Table of Contents
Fetching ...

Towards Better Understanding of Contrastive Sentence Representation Learning: A Unified Paradigm for Gradient

Mingxin Li, Richong Zhang, Zhijie Nie

TL;DR

This work discovers that four effective contrastive losses can be integrated into a unified paradigm, which depends on three components: the **Gradient Dissipation**, the **Weight**, and the **Ratio**, and enables non-contrastive SSL to achieve outstanding performance in STS.

Abstract

Sentence Representation Learning (SRL) is a crucial task in Natural Language Processing (NLP), where contrastive Self-Supervised Learning (SSL) is currently a mainstream approach. However, the reasons behind its remarkable effectiveness remain unclear. Specifically, many studies have investigated the similarities between contrastive and non-contrastive SSL from a theoretical perspective. Such similarities can be verified in classification tasks, where the two approaches achieve comparable performance. But in ranking tasks (i.e., Semantic Textual Similarity (STS) in SRL), contrastive SSL significantly outperforms non-contrastive SSL. Therefore, two questions arise: First, *what commonalities enable various contrastive losses to achieve superior performance in STS?* Second, *how can we make non-contrastive SSL also effective in STS?* To address these questions, we start from the perspective of gradients and discover that four effective contrastive losses can be integrated into a unified paradigm, which depends on three components: the **Gradient Dissipation**, the **Weight**, and the **Ratio**. Then, we conduct an in-depth analysis of the roles these components play in optimization and experimentally demonstrate their significance for model performance. Finally, by adjusting these components, we enable non-contrastive SSL to achieve outstanding performance in STS.

Towards Better Understanding of Contrastive Sentence Representation Learning: A Unified Paradigm for Gradient

TL;DR

This work discovers that four effective contrastive losses can be integrated into a unified paradigm, which depends on three components: the **Gradient Dissipation**, the **Weight**, and the **Ratio**, and enables non-contrastive SSL to achieve outstanding performance in STS.

Abstract

Sentence Representation Learning (SRL) is a crucial task in Natural Language Processing (NLP), where contrastive Self-Supervised Learning (SSL) is currently a mainstream approach. However, the reasons behind its remarkable effectiveness remain unclear. Specifically, many studies have investigated the similarities between contrastive and non-contrastive SSL from a theoretical perspective. Such similarities can be verified in classification tasks, where the two approaches achieve comparable performance. But in ranking tasks (i.e., Semantic Textual Similarity (STS) in SRL), contrastive SSL significantly outperforms non-contrastive SSL. Therefore, two questions arise: First, *what commonalities enable various contrastive losses to achieve superior performance in STS?* Second, *how can we make non-contrastive SSL also effective in STS?* To address these questions, we start from the perspective of gradients and discover that four effective contrastive losses can be integrated into a unified paradigm, which depends on three components: the **Gradient Dissipation**, the **Weight**, and the **Ratio**. Then, we conduct an in-depth analysis of the roles these components play in optimization and experimentally demonstrate their significance for model performance. Finally, by adjusting these components, we enable non-contrastive SSL to achieve outstanding performance in STS.
Paper Structure (28 sections, 1 theorem, 45 equations, 12 figures, 7 tables)

This paper contains 28 sections, 1 theorem, 45 equations, 12 figures, 7 tables.

Key Result

Lemma 1

For an anchor $h_i$ and its positive sample $h_i'$ and negative sample $h_j'$, assume the angle between the plane $Oh_ih_i'$ and the plane $Oh_ih_j'$ is $\alpha$. When $h_i$ moves along the optimization direction $\lambda(rh_i'-h_j')$, $r$ must satisfy to ensure the distance from $h_i$ to $h_i'$ becomes closer after the optimization step.

Figures (12)

  • Figure 1: Average Spearman's correlation on Semantic Textual Similarity tasks for ineffective optimization objectives before ("ori") and after ("mod") modifications under different backbones.
  • Figure 2: Average values of gradient dissipation term under different $\mu_\mathrm{pos}$-$\mu_\mathrm{neg}$ pairs for ArcCon and MET. Appendix \ref{['sec:appendix_role_illustration']} shows the results for InfoNCE and MPT.
  • Figure 3: Variations in the average portion of the hardest negative samples in the weight across different $\mu_\mathrm{neg}$, under different temperatures $\tau$.
  • Figure 4: Average values of three dynamic ratio terms, The shaded areas indicate that these $\mu_\mathrm{pos}$-$\mu_\mathrm{neg}$ pairs do not occur in the actual optimization process, where the lower part is due to gradient dissipation, and the upper part is because there is always $\mu_\mathrm{pos} < \mu_\mathrm{neg}$.
  • Figure 5: Distribution of cosine similarity for anchor-negative pairs (left) and anchor-positive pairs (right).
  • ...and 7 more figures

Theorems & Definitions (4)

  • Conjecture 1
  • Conjecture 2
  • Lemma 1
  • Conjecture 3