Table of Contents
Fetching ...

Linq-Embed-Mistral Technical Report

Chanyeol Choi, Junseong Kim, Seolhwa Lee, Jihoon Kwon, Sangmo Gu, Yejin Kim, Minkyung Cho, Jy-yong Sohn

TL;DR

This work tackles the challenge of high-quality text retrieval by refining embedding training through data crafting, filtering, and negative mining, applied to benchmark and synthetic data. Built on E5-Mistral and Mistral-7B, Linq-Embed-Mistral achieves top-tier MTEB performance (68.2 average, 60.2 retrieval) and exemplifies how data quality and task-wise training strategies boost retrieval reliability. The approach introduces homogeneous task ordering and mixed task fine-tuning to stabilize training, plus streamlined 4-bit evaluation to accelerate validation. Together, these innovations yield state-of-the-art retrieval accuracy while maintaining practical training and evaluation efficiency for large-scale embedding systems.

Abstract

This report explores the enhancement of text retrieval performance using advanced data refinement techniques. We develop Linq-Embed-Mistral\footnote{\url{https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral}} by building on the E5-mistral and Mistral-7B-v0.1 models, focusing on sophisticated data crafting, data filtering, and negative mining methods, which are highly tailored to each task, applied to both existing benchmark dataset and highly tailored synthetic dataset generated via large language models (LLMs). Linq-Embed-Mistral excels in the MTEB benchmarks (as of May 29, 2024), achieving an average score of 68.2 across 56 datasets, and ranks 1st among all models for retrieval tasks on the MTEB leaderboard with a performance score of 60.2. This performance underscores its superior capability in enhancing search precision and reliability. Our contributions include advanced data refinement methods that significantly improve model performance on benchmark and synthetic datasets, techniques for homogeneous task ordering and mixed task fine-tuning to enhance model generalization and stability, and a streamlined evaluation process using 4-bit precision and a light retrieval evaluation set, which accelerates validation without sacrificing accuracy.

Linq-Embed-Mistral Technical Report

TL;DR

This work tackles the challenge of high-quality text retrieval by refining embedding training through data crafting, filtering, and negative mining, applied to benchmark and synthetic data. Built on E5-Mistral and Mistral-7B, Linq-Embed-Mistral achieves top-tier MTEB performance (68.2 average, 60.2 retrieval) and exemplifies how data quality and task-wise training strategies boost retrieval reliability. The approach introduces homogeneous task ordering and mixed task fine-tuning to stabilize training, plus streamlined 4-bit evaluation to accelerate validation. Together, these innovations yield state-of-the-art retrieval accuracy while maintaining practical training and evaluation efficiency for large-scale embedding systems.

Abstract

This report explores the enhancement of text retrieval performance using advanced data refinement techniques. We develop Linq-Embed-Mistral\footnote{\url{https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral}} by building on the E5-mistral and Mistral-7B-v0.1 models, focusing on sophisticated data crafting, data filtering, and negative mining methods, which are highly tailored to each task, applied to both existing benchmark dataset and highly tailored synthetic dataset generated via large language models (LLMs). Linq-Embed-Mistral excels in the MTEB benchmarks (as of May 29, 2024), achieving an average score of 68.2 across 56 datasets, and ranks 1st among all models for retrieval tasks on the MTEB leaderboard with a performance score of 60.2. This performance underscores its superior capability in enhancing search precision and reliability. Our contributions include advanced data refinement methods that significantly improve model performance on benchmark and synthetic datasets, techniques for homogeneous task ordering and mixed task fine-tuning to enhance model generalization and stability, and a streamlined evaluation process using 4-bit precision and a light retrieval evaluation set, which accelerates validation without sacrificing accuracy.

Paper Structure

This paper contains 31 sections, 2 figures, 14 tables.

Figures (2)

  • Figure 1: Overview of our proposed methods of refining the Benchmark Dataset.
  • Figure 2: Strategies for streamlining the evaluation process.