Efficient fine-tuning methodology of text embedding models for information retrieval: contrastive learning penalty (clp)
Jeongsu Yu
TL;DR
This paper tackles the challenge of efficiently fine-tuning pre-trained text embedding models for information retrieval by integrating three components: ANCE-based data selection for informative negatives, a novel Contrastive Learning Penalty (CLP) to mitigate drift between negatives and their positives, and a Mixture of Experts (MoE) approach applied to an intermediate layer to tailor embeddings to diverse input characteristics. The proposed CLP augments the conventional contrastive loss with a penalty weighted by $lambda$, guiding negative representations to maintain beneficial relationships with their positive queries, while ANCE provides hard negatives and MoE enables specialized embeddings without full-model retraining. Empirical evaluation on the MIRACL multilingual dataset (Korean, Hindi, Persian) shows meaningful gains in nDCG compared to baselines, with Persian showing substantial improvement when using CLP and MoE, and the best configuration achieving approximately a 5-point uplift. The work provides practical, end-to-end fine-tuning guidance and releases code and models for broader adoption in domain-specific, multilingual information retrieval systems.
Abstract
Text embedding models play a crucial role in natural language processing, particularly in information retrieval, and their importance is further highlighted with the recent utilization of RAG (Retrieval- Augmented Generation). This study presents an efficient fine-tuning methodology encompassing data selection, loss function, and model architecture to enhance the information retrieval performance of pre-trained text embedding models. In particular, this study proposes a novel Contrastive Learning Penalty function that overcomes the limitations of existing Contrastive Learning. The proposed methodology achieves significant performance improvements over existing methods in document retrieval tasks. This study is expected to contribute to improving the performance of information retrieval systems through fine-tuning of text embedding models. The code for this study can be found at https://github.com/CreaLabs/Enhanced-BGE-M3-with-CLP-and-MoE, and the best-performing model can be found at https://huggingface.co/CreaLabs.
