Table of Contents
Fetching ...

Conan-embedding: General Text Embedding with More and Better Negative Samples

Shiyu Li, Yang Tang, Shizhe Chen, Xi Chen

TL;DR

Conan-embedding tackles negative-sample bottlenecks in contrastive text embedding by introducing dynamic hard negative mining and Cross-GPU Batch Balance Loss, supplemented with LLM prompt-response data to expand high-quality negatives. The method combines weakly-supervised pre-training with supervised fine-tuning across retrieval and STS tasks, using $L_{neg}$ and $\mathcal{L}_{cos}$ objectives, and a cross-GPU objective to stabilize multi-task optimization. Empirically, it achieves top performance on CMTEB (average $72.62$) and shows clear gains in retrieval and reranking through ablations, with smoother optimization dynamics. The work provides a scalable recipe for leveraging more diverse negatives and multi-task training to improve embedding quality, and releases the model to the community via Huggingface.

Abstract

With the growing popularity of RAG, the capabilities of embedding models are gaining increasing attention. Embedding models are primarily trained through contrastive loss learning, with negative examples being a key component. Previous work has proposed various hard negative mining strategies, but these strategies are typically employed as preprocessing steps. In this paper, we propose the conan-embedding model, which maximizes the utilization of more and higher-quality negative examples. Specifically, since the model's ability to handle preprocessed negative examples evolves during training, we propose dynamic hard negative mining method to expose the model to more challenging negative examples throughout the training process. Secondly, contrastive learning requires as many negative examples as possible but is limited by GPU memory constraints. Therefore, we use a Cross-GPU balancing Loss to provide more negative examples for embedding training and balance the batch size across multiple tasks. Moreover, we also discovered that the prompt-response pairs from LLMs can be used for embedding training. Our approach effectively enhances the capabilities of embedding models, currently ranking first on the Chinese leaderboard of Massive text embedding benchmark

Conan-embedding: General Text Embedding with More and Better Negative Samples

TL;DR

Conan-embedding tackles negative-sample bottlenecks in contrastive text embedding by introducing dynamic hard negative mining and Cross-GPU Batch Balance Loss, supplemented with LLM prompt-response data to expand high-quality negatives. The method combines weakly-supervised pre-training with supervised fine-tuning across retrieval and STS tasks, using and objectives, and a cross-GPU objective to stabilize multi-task optimization. Empirically, it achieves top performance on CMTEB (average ) and shows clear gains in retrieval and reranking through ablations, with smoother optimization dynamics. The work provides a scalable recipe for leveraging more diverse negatives and multi-task training to improve embedding quality, and releases the model to the community via Huggingface.

Abstract

With the growing popularity of RAG, the capabilities of embedding models are gaining increasing attention. Embedding models are primarily trained through contrastive loss learning, with negative examples being a key component. Previous work has proposed various hard negative mining strategies, but these strategies are typically employed as preprocessing steps. In this paper, we propose the conan-embedding model, which maximizes the utilization of more and higher-quality negative examples. Specifically, since the model's ability to handle preprocessed negative examples evolves during training, we propose dynamic hard negative mining method to expose the model to more challenging negative examples throughout the training process. Secondly, contrastive learning requires as many negative examples as possible but is limited by GPU memory constraints. Therefore, we use a Cross-GPU balancing Loss to provide more negative examples for embedding training and balance the batch size across multiple tasks. Moreover, we also discovered that the prompt-response pairs from LLMs can be used for embedding training. Our approach effectively enhances the capabilities of embedding models, currently ranking first on the Chinese leaderboard of Massive text embedding benchmark
Paper Structure (14 sections, 3 equations, 4 figures, 4 tables)

This paper contains 14 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The pipeline of our methods includes both weakly-supervised and supervised training. During weakly-supervised training, we collect 0.75 billion pairs of datasets and select 0.4 billion of them. During supervised training, we use a dynamic hard negative mining strategy to better fine-tune the model.
  • Figure 2: Dynamic Hard Negative Mining vs. Standard Hard Negative Mining: Score-Steps Curves. Hard negatives are checked every 100 steps. When the score multiplied by 1.15 is less than the initial score and the absolute value of the score is less than 0.8, we consider the negative example no longer difficult and replace it with a new hard negative.
  • Figure 3: An example of cross-GPU batch balance Loss. For retrieval task, we leverage multiple GPUs to incorporate more negative examples. For STS task, we increase the batch size to include more cases for comparison.
  • Figure 4: Comparison of Loss Curves Before and After Using the Cross-GPU Batch Balance Loss Method.