Table of Contents
Fetching ...

Embedding And Clustering Your Data Can Improve Contrastive Pretraining

Luke Merrick

TL;DR

This work extends source-stratified contrastive pretraining by introducing semantic clustering of embeddings to form topic-like sub-sources, aiming to improve negative mining without incurring extra embedding costs. Using MSMARCO, the authors cluster either queries or passages into k = 10 groups via spherical k-means, train BERT-based encoders with CLS embeddings, and observe higher training hardness and modest retrieval gains. Empirically, clustering yields about a 2% NDCG@10 improvement on MSMARCO dev and generally positive but dataset-dependent effects across the MTEB Retrieval suite, highlighting both benefits and limitations of this approach. The paper also synthesizes connections to TAS-B and ANCE, offers a triangle-inequality perspective on why topic-aligned embeddings may improve learning, and outlines future directions for smarter clustering, data filtering, and curriculum-based training in contrastive pretraining.

Abstract

Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm to further split training data apart by the semantic clusters within each source. Experimentally, we observe a notable increase in NDCG@10 when pretraining a BERT-based text embedding model on query-passage pairs from the MSMARCO passage retrieval dataset. Additionally, we conceptually connect our clustering approach to both the Topic Aware Sampling (TAS) aspect of the TAS-B methodology and the nearest-neighbor-based hard-negative mining aspect of the ANCE methodology and discuss how this unified view motivates future lines of research on the organization of contrastive pretraining data.

Embedding And Clustering Your Data Can Improve Contrastive Pretraining

TL;DR

This work extends source-stratified contrastive pretraining by introducing semantic clustering of embeddings to form topic-like sub-sources, aiming to improve negative mining without incurring extra embedding costs. Using MSMARCO, the authors cluster either queries or passages into k = 10 groups via spherical k-means, train BERT-based encoders with CLS embeddings, and observe higher training hardness and modest retrieval gains. Empirically, clustering yields about a 2% NDCG@10 improvement on MSMARCO dev and generally positive but dataset-dependent effects across the MTEB Retrieval suite, highlighting both benefits and limitations of this approach. The paper also synthesizes connections to TAS-B and ANCE, offers a triangle-inequality perspective on why topic-aligned embeddings may improve learning, and outlines future directions for smarter clustering, data filtering, and curriculum-based training in contrastive pretraining.

Abstract

Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm to further split training data apart by the semantic clusters within each source. Experimentally, we observe a notable increase in NDCG@10 when pretraining a BERT-based text embedding model on query-passage pairs from the MSMARCO passage retrieval dataset. Additionally, we conceptually connect our clustering approach to both the Topic Aware Sampling (TAS) aspect of the TAS-B methodology and the nearest-neighbor-based hard-negative mining aspect of the ANCE methodology and discuss how this unified view motivates future lines of research on the organization of contrastive pretraining data.
Paper Structure (44 sections, 1 equation, 3 figures, 2 tables)

This paper contains 44 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Figure 7 from merrick2024arctic, which depicts the NDCG@10 score on the SciDocs dataset during training. This experiment showed that stratified training data (dark blue line) can lead to better quality than simply a large batch size (purple line), while a combination of both approaches (light blue line) does even better than each individual. It also shows how the stratified approach improves the long-term trajectory more, suggesting an element of curriculum learning may be at play.
  • Figure 2: Experimental training loss curves (rolling average of 10 steps over faded original values). Clustering by pseudo-sub-sources leads to substantially higher average training loss, as well as higher variance step-to-step.
  • Figure 3: The triangle inequality $|a - b| \leq c \leq a + b$ guarantees that for any pair of similar vectors, a third vector that is similar to one of them cannot be too dissimilar to the other, while a third vector that is dissimilar to one cannot be too similar to the other.