Table of Contents
Fetching ...

GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning

Aivin V. Solatorio

TL;DR

<3-5 sentence high-level summary> GISTEmbed introduces a guide-model–driven method to dynamically curate in-batch negatives during contrastive fine-tuning of text embeddings, addressing data quality and sampling biases that plague traditional unsupervised triplet mining. By computing similarities with a powerful guide model and masking potentially relevant negatives, it replaces random batch negatives with a purer, context-aware set, formalized as $\mathcal{L}_G$ using $G_B$. The approach shows consistent improvements on the MTEB benchmark across model sizes, with particularly strong benefits for smaller models, and it achieves notable gains in semantic textual similarity tasks and certain classification/reranking facets. Augmenting training data with MTEB classification triplets and task-specific synthetic data further boosts performance, suggesting GISTEmbed’s practical potential to democratize high-quality embeddings for resource-constrained settings.

Abstract

Embedding models are integral to AI applications like semantic search, personalized recommendations, and retrieval augmented generation for LLMs, necessitating high-quality training data. However, the limited scalability of manual data curation prompts the need for automated methods to ensure data integrity. Traditional unsupervised triplet mining automates training data generation, crucial for embedding model training, yet inadvertently injects biases and noise, thereby degrading model performance. Addressing this, we introduce GISTEmbed, a novel strategy that enhances in-batch negative selection during contrastive training through a guide model. This approach departs from reliance on random sampling and equal utility assumption of batch negatives, significantly reducing noise from data quality issues and improving model fine-tuning. Benchmarked against the Massive Text Embedding Benchmark (MTEB), GISTEmbed showcases consistent performance improvements across various model sizes and achieves state-of-the-art results in select categories. This framework enables significant enhancements for smaller models by leveraging the capabilities of powerful yet resource-intensive large models. GISTEmbed can potentially revolutionize the creation of highly efficient, smaller models, democratizing access to advanced AI technologies. Making these technologies more accessible and cost-effective, especially for applications constrained by resources, significantly expands the impact and accessibility of state-of-the-art AI solutions across diverse sectors.

GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning

TL;DR

<3-5 sentence high-level summary> GISTEmbed introduces a guide-model–driven method to dynamically curate in-batch negatives during contrastive fine-tuning of text embeddings, addressing data quality and sampling biases that plague traditional unsupervised triplet mining. By computing similarities with a powerful guide model and masking potentially relevant negatives, it replaces random batch negatives with a purer, context-aware set, formalized as using . The approach shows consistent improvements on the MTEB benchmark across model sizes, with particularly strong benefits for smaller models, and it achieves notable gains in semantic textual similarity tasks and certain classification/reranking facets. Augmenting training data with MTEB classification triplets and task-specific synthetic data further boosts performance, suggesting GISTEmbed’s practical potential to democratize high-quality embeddings for resource-constrained settings.

Abstract

Embedding models are integral to AI applications like semantic search, personalized recommendations, and retrieval augmented generation for LLMs, necessitating high-quality training data. However, the limited scalability of manual data curation prompts the need for automated methods to ensure data integrity. Traditional unsupervised triplet mining automates training data generation, crucial for embedding model training, yet inadvertently injects biases and noise, thereby degrading model performance. Addressing this, we introduce GISTEmbed, a novel strategy that enhances in-batch negative selection during contrastive training through a guide model. This approach departs from reliance on random sampling and equal utility assumption of batch negatives, significantly reducing noise from data quality issues and improving model fine-tuning. Benchmarked against the Massive Text Embedding Benchmark (MTEB), GISTEmbed showcases consistent performance improvements across various model sizes and achieves state-of-the-art results in select categories. This framework enables significant enhancements for smaller models by leveraging the capabilities of powerful yet resource-intensive large models. GISTEmbed can potentially revolutionize the creation of highly efficient, smaller models, democratizing access to advanced AI technologies. Making these technologies more accessible and cost-effective, especially for applications constrained by resources, significantly expands the impact and accessibility of state-of-the-art AI solutions across diverse sectors.
Paper Structure (28 sections, 3 equations, 2 figures, 8 tables)

This paper contains 28 sections, 3 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Visualization of the GISTEmbed framework for the dynamic selection of in-batch negatives for contrastive learning of embedding models. A guide model is used during training to dynamically exclude texts in the batch that are likely related to the query-positive pair being evaluated. The framework addresses potential data labeling issues and also relaxes assumptions regarding the formation of in-batch negatives that prior approaches use.
  • Figure 2: Visualization of various in-batch negatives selection strategies for contrastive learning (dashed orange boxes). Each panel contains triplets in a training batch, with the columns representing the queries, assigned positives, and assigned negatives. Panel A shows the original strategy for selecting in-batch negatives where all the assigned negatives in the training data are considered. Panel B visualizes the selection of in-batch negatives for the bi-directional InfoNCE loss which includes the queries as well. The full-sample selection of in-batch negatives is shown in Panel C. While Panel D presents how GISTEmbed, with the guide model-informed selection of in-batch negatives, works. In this example, the query-positive pair (q:Capital cities, p:Ottawa) can be considered semantically related to the other texts in the batch [q:"Where is Manila?", n:"What is the capital of Canada?"]. The guide model serves as a filter to remove these texts when selecting the in-batch negatives for computing the loss.