Improving embedding with contrastive fine-tuning on small datasets with expert-augmented scores
Jun Lu, David Li, Bill Ding, Yu Kang
TL;DR
This work tackles improving text embeddings when labeled data are scarce by introducing a contrastive fine-tuning framework that uses soft labels derived from expert-augmented scores. Unlike traditional hard-label fine-tuning, the method leverages $K$ expert models to compute similarities $s_k$ and derives soft targets $\hat{y}_i$ (e.g., Soft-1, Soft-2, Soft-3) to guide learning, aiming to reduce anisotropy while preserving retrieval capabilities. Evaluations on a small Q&A-derived dataset and broad MTEB retrieval benchmarks show that Soft-1 and Soft-2 typically outperform the benchmark model in nDCG@10 and mAP@10, with Soft-1 offering the best robustness and AUPRC on held-out data. The approach is cost-effective, requiring only a modest fine-tuning footprint and no additional human labeling, making it practical for real-world retrieval and RAG-style systems where labeled data are limited. Overall, the method advances practical high-quality embeddings by balancing task-specific gains with general-purpose utility, and it opens avenues for integrating more diverse expert signals and addressing anisotropy in high-dimensional embedding spaces.
Abstract
This paper presents an approach to improve text embedding models through contrastive fine-tuning on small datasets augmented with expert scores. It focuses on enhancing semantic textual similarity tasks and addressing text retrieval problems. The proposed method uses soft labels derived from expert-augmented scores to fine-tune embedding models, preserving their versatility and ensuring retrieval capability is improved. The paper evaluates the method using a Q\&A dataset from an online shopping website and eight expert models. Results show improved performance over a benchmark model across multiple metrics on various retrieval tasks from the massive text embedding benchmark (MTEB). The method is cost-effective and practical for real-world applications, especially when labeled data is scarce.
