Improving embedding with contrastive fine-tuning on small datasets with expert-augmented scores

Jun Lu; David Li; Bill Ding; Yu Kang

Improving embedding with contrastive fine-tuning on small datasets with expert-augmented scores

Jun Lu, David Li, Bill Ding, Yu Kang

TL;DR

This work tackles improving text embeddings when labeled data are scarce by introducing a contrastive fine-tuning framework that uses soft labels derived from expert-augmented scores. Unlike traditional hard-label fine-tuning, the method leverages $K$ expert models to compute similarities $s_k$ and derives soft targets $\hat{y}_i$ (e.g., Soft-1, Soft-2, Soft-3) to guide learning, aiming to reduce anisotropy while preserving retrieval capabilities. Evaluations on a small Q&A-derived dataset and broad MTEB retrieval benchmarks show that Soft-1 and Soft-2 typically outperform the benchmark model in nDCG@10 and mAP@10, with Soft-1 offering the best robustness and AUPRC on held-out data. The approach is cost-effective, requiring only a modest fine-tuning footprint and no additional human labeling, making it practical for real-world retrieval and RAG-style systems where labeled data are limited. Overall, the method advances practical high-quality embeddings by balancing task-specific gains with general-purpose utility, and it opens avenues for integrating more diverse expert signals and addressing anisotropy in high-dimensional embedding spaces.

Abstract

This paper presents an approach to improve text embedding models through contrastive fine-tuning on small datasets augmented with expert scores. It focuses on enhancing semantic textual similarity tasks and addressing text retrieval problems. The proposed method uses soft labels derived from expert-augmented scores to fine-tune embedding models, preserving their versatility and ensuring retrieval capability is improved. The paper evaluates the method using a Q\&A dataset from an online shopping website and eight expert models. Results show improved performance over a benchmark model across multiple metrics on various retrieval tasks from the massive text embedding benchmark (MTEB). The method is cost-effective and practical for real-world applications, especially when labeled data is scarce.

Improving embedding with contrastive fine-tuning on small datasets with expert-augmented scores

TL;DR

expert models to compute similarities

and derives soft targets

(e.g., Soft-1, Soft-2, Soft-3) to guide learning, aiming to reduce anisotropy while preserving retrieval capabilities. Evaluations on a small Q&A-derived dataset and broad MTEB retrieval benchmarks show that Soft-1 and Soft-2 typically outperform the benchmark model in nDCG@10 and mAP@10, with Soft-1 offering the best robustness and AUPRC on held-out data. The approach is cost-effective, requiring only a modest fine-tuning footprint and no additional human labeling, making it practical for real-world retrieval and RAG-style systems where labeled data are limited. Overall, the method advances practical high-quality embeddings by balancing task-specific gains with general-purpose utility, and it opens avenues for integrating more diverse expert signals and addressing anisotropy in high-dimensional embedding spaces.

Abstract

Paper Structure (11 sections, 5 equations, 5 figures, 4 tables)

This paper contains 11 sections, 5 equations, 5 figures, 4 tables.

Introduction
Related Work
Proposed Method
Contrastive Fine-Tuning with Hard Label
Contrastive Fine-Tuning with Expert-Augmented Scores
Experiments
Evaluation Datasets
Results on MTEB
Results on Held-Out Set
Distributional Results
Conclusion

Figures (5)

Figure 1: Positive and negative label distributions of eight expert models for the given dataset.
Figure 2: Distributions of the IntraSample and InterSample for Benchmark, Hard label, Soft-1, and Soft-2, respectively. All models exhibit a distinct separation between IntraSample and InterSample distributions. In this scenario, the Hard label model appears to perform the best due to a more pronounced difference between the two modalities, while in terms of AUPRC, the Soft-1 performs best (Table \ref{['tb:exp_auprc']}).
Figure 3: PR curves of held-out set analysis for various methods. Threshold values (see Figure \ref{['fig:expert_prcurive_bin']}) are only shown for Benchmark and Soft-1 for conciseness.
Figure 4: Diagram illustrating the inter- and intra-relationship between different queries and passages.
Figure 5: Distribution of cosine similarities between the embedding vectors of different instruction texts.

Improving embedding with contrastive fine-tuning on small datasets with expert-augmented scores

TL;DR

Abstract

Improving embedding with contrastive fine-tuning on small datasets with expert-augmented scores

Authors

TL;DR

Abstract

Table of Contents

Figures (5)