Embedding And Clustering Your Data Can Improve Contrastive Pretraining

Luke Merrick

Embedding And Clustering Your Data Can Improve Contrastive Pretraining

Luke Merrick

TL;DR

This work extends source-stratified contrastive pretraining by introducing semantic clustering of embeddings to form topic-like sub-sources, aiming to improve negative mining without incurring extra embedding costs. Using MSMARCO, the authors cluster either queries or passages into k = 10 groups via spherical k-means, train BERT-based encoders with CLS embeddings, and observe higher training hardness and modest retrieval gains. Empirically, clustering yields about a 2% NDCG@10 improvement on MSMARCO dev and generally positive but dataset-dependent effects across the MTEB Retrieval suite, highlighting both benefits and limitations of this approach. The paper also synthesizes connections to TAS-B and ANCE, offers a triangle-inequality perspective on why topic-aligned embeddings may improve learning, and outlines future directions for smarter clustering, data filtering, and curriculum-based training in contrastive pretraining.

Abstract

Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm to further split training data apart by the semantic clusters within each source. Experimentally, we observe a notable increase in NDCG@10 when pretraining a BERT-based text embedding model on query-passage pairs from the MSMARCO passage retrieval dataset. Additionally, we conceptually connect our clustering approach to both the Topic Aware Sampling (TAS) aspect of the TAS-B methodology and the nearest-neighbor-based hard-negative mining aspect of the ANCE methodology and discuss how this unified view motivates future lines of research on the organization of contrastive pretraining data.

Embedding And Clustering Your Data Can Improve Contrastive Pretraining

TL;DR

Abstract

Paper Structure (44 sections, 1 equation, 3 figures, 2 tables)

This paper contains 44 sections, 1 equation, 3 figures, 2 tables.

Introduction
Methodology
Experiments
Clustering Details
Training
Results
In Search Of Deeper Understanding
Concept One: The Clustering Hypothesis And Topicality Aware Sampling
Concept Two: The ANCE Perspective On Hard Negative Mining
A Possible Synthesis: Topic-Aligned Embeddings And A Triangle Inequality Thought Experiment
Reality Check: Does This Explanation's Assumptions Hold Up In Practice?
Limitations And Alternatives
A Need For Curriculum Learning
Why Not Just Hard Negative Mine For Pretraining?
Related Work
...and 29 more sections

Figures (3)

Figure 1: Figure 7 from merrick2024arctic, which depicts the NDCG@10 score on the SciDocs dataset during training. This experiment showed that stratified training data (dark blue line) can lead to better quality than simply a large batch size (purple line), while a combination of both approaches (light blue line) does even better than each individual. It also shows how the stratified approach improves the long-term trajectory more, suggesting an element of curriculum learning may be at play.
Figure 2: Experimental training loss curves (rolling average of 10 steps over faded original values). Clustering by pseudo-sub-sources leads to substantially higher average training loss, as well as higher variance step-to-step.
Figure 3: The triangle inequality $|a - b| \leq c \leq a + b$ guarantees that for any pair of similar vectors, a third vector that is similar to one of them cannot be too dissimilar to the other, while a third vector that is dissimilar to one cannot be too similar to the other.

Embedding And Clustering Your Data Can Improve Contrastive Pretraining

TL;DR

Abstract

Embedding And Clustering Your Data Can Improve Contrastive Pretraining

Authors

TL;DR

Abstract

Table of Contents

Figures (3)