Embedding And Clustering Your Data Can Improve Contrastive Pretraining
Luke Merrick
TL;DR
This work extends source-stratified contrastive pretraining by introducing semantic clustering of embeddings to form topic-like sub-sources, aiming to improve negative mining without incurring extra embedding costs. Using MSMARCO, the authors cluster either queries or passages into k = 10 groups via spherical k-means, train BERT-based encoders with CLS embeddings, and observe higher training hardness and modest retrieval gains. Empirically, clustering yields about a 2% NDCG@10 improvement on MSMARCO dev and generally positive but dataset-dependent effects across the MTEB Retrieval suite, highlighting both benefits and limitations of this approach. The paper also synthesizes connections to TAS-B and ANCE, offers a triangle-inequality perspective on why topic-aligned embeddings may improve learning, and outlines future directions for smarter clustering, data filtering, and curriculum-based training in contrastive pretraining.
Abstract
Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm to further split training data apart by the semantic clusters within each source. Experimentally, we observe a notable increase in NDCG@10 when pretraining a BERT-based text embedding model on query-passage pairs from the MSMARCO passage retrieval dataset. Additionally, we conceptually connect our clustering approach to both the Topic Aware Sampling (TAS) aspect of the TAS-B methodology and the nearest-neighbor-based hard-negative mining aspect of the ANCE methodology and discuss how this unified view motivates future lines of research on the organization of contrastive pretraining data.
