Table of Contents
Fetching ...

Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark

Hongliu Cao

TL;DR

This work surveys recent advances in universal text embeddings with a focus on top-performing models on the Massive Text Embedding Benchmark (MTEB). It classifies methods into data-focused, loss-focused, and LLM-focused approaches, highlighting how diverse data, improved losses, and LLM backbones drive cross-task and cross-language generalization. Key contributions include detailed analyses of data sources (e.g., CCPairs, multilingual corpora), novel losses (e.g., AnglE), and LLM-driven strategies (e.g., synthetic data, bidirectional decoding adaptations, and knowledge distillation). The review emphasizes strong gains in retrieval-related tasks while noting persistent gaps in summarization, multilingual universality, and benchmark breadth, pointing to future work on more diverse datasets, more robust similarity measures, and efficient, scalable embeddings.

Abstract

Text embedding methods have become increasingly popular in both industrial and academic fields due to their critical role in a variety of natural language processing tasks. The significance of universal text embeddings has been further highlighted with the rise of Large Language Models (LLMs) applications such as Retrieval-Augmented Systems (RAGs). While previous models have attempted to be general-purpose, they often struggle to generalize across tasks and domains. However, recent advancements in training data quantity, quality and diversity; synthetic data generation from LLMs as well as using LLMs as backbones encourage great improvements in pursuing universal text embeddings. In this paper, we provide an overview of the recent advances in universal text embedding models with a focus on the top performing text embeddings on Massive Text Embedding Benchmark (MTEB). Through detailed comparison and analysis, we highlight the key contributions and limitations in this area, and propose potentially inspiring future research directions.

Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark

TL;DR

This work surveys recent advances in universal text embeddings with a focus on top-performing models on the Massive Text Embedding Benchmark (MTEB). It classifies methods into data-focused, loss-focused, and LLM-focused approaches, highlighting how diverse data, improved losses, and LLM backbones drive cross-task and cross-language generalization. Key contributions include detailed analyses of data sources (e.g., CCPairs, multilingual corpora), novel losses (e.g., AnglE), and LLM-driven strategies (e.g., synthetic data, bidirectional decoding adaptations, and knowledge distillation). The review emphasizes strong gains in retrieval-related tasks while noting persistent gaps in summarization, multilingual universality, and benchmark breadth, pointing to future work on more diverse datasets, more robust similarity measures, and efficient, scalable embeddings.

Abstract

Text embedding methods have become increasingly popular in both industrial and academic fields due to their critical role in a variety of natural language processing tasks. The significance of universal text embeddings has been further highlighted with the rise of Large Language Models (LLMs) applications such as Retrieval-Augmented Systems (RAGs). While previous models have attempted to be general-purpose, they often struggle to generalize across tasks and domains. However, recent advancements in training data quantity, quality and diversity; synthetic data generation from LLMs as well as using LLMs as backbones encourage great improvements in pursuing universal text embeddings. In this paper, we provide an overview of the recent advances in universal text embedding models with a focus on the top performing text embeddings on Massive Text Embedding Benchmark (MTEB). Through detailed comparison and analysis, we highlight the key contributions and limitations in this area, and propose potentially inspiring future research directions.
Paper Structure (40 sections, 13 equations, 4 figures, 4 tables)

This paper contains 40 sections, 13 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The 4 different eras of text embeddings. 1st era: Count-based Embeddings (with dimension reduction techniques); 2nd era: Static dense word embeddings, 3rd era: Contextualized embeddings; 4th era: Universal text embeddings.
  • Figure 2: Representative state of the art universal text embeddings and their main focus/contributions.
  • Figure 3: Cosine function's saturation zones exhibit near-zero gradients, which makes it difficult for the model to learn during backpropagation.
  • Figure 4: The top performing text embeddings on MTEB benchmark: X-axis is the average performance over 56 MTEB benchmark datasets, Y-axis is the log of Model parameter numbers (in Millions). Different colors indicate different embedding dimensions and different shapes indicate different max token sizes.