Table of Contents
Fetching ...

LLMs are Also Effective Embedding Models: An In-depth Overview

Chongyang Tao, Tao Shen, Shen Gao, Junshuo Zhang, Zhen Li, Kai Hua, Wenpeng Hu, Zhengwei Tao, Shuai Ma

TL;DR

The paper surveys the shift from encoder-centric embeddings to decoder-only LLM-based embeddings for NLP and information retrieval tasks. It analyzes two core paradigms—tuning-free direct prompting and tuning-based fine-tuning—along with architecture, objectives, and data construction. It further reviews specialized embedding settings (multilingual, code, long context, cross-modal, reasoning-aware, domain-specific) and discusses data synthesis, hard-negative mining, and scaling considerations. It also highlights enduring challenges such as efficiency-accuracy trade-offs, bias, robustness, and long-context handling, offering a structured roadmap for future research.

Abstract

Large language models (LLMs) have revolutionized natural language processing by achieving state-of-the-art performance across various tasks. Recently, their effectiveness as embedding models has gained attention, marking a paradigm shift from traditional encoder-only models like ELMo and BERT to decoder-only, large-scale LLMs such as GPT, LLaMA, and Mistral. This survey provides an in-depth overview of this transition, beginning with foundational techniques before the LLM era, followed by LLM-based embedding models through two main strategies to derive embeddings from LLMs. 1) Direct prompting: We mainly discuss the prompt designs and the underlying rationale for deriving competitive embeddings. 2) Data-centric tuning: We cover extensive aspects that affect tuning an embedding model, including model architecture, training objectives, data constructions, etc. Upon the above, we also cover advanced methods for producing embeddings from longer texts, multilingual, code, cross-modal data, as well as reasoning-aware and other domain-specific scenarios. Furthermore, we discuss factors affecting choices of embedding models, such as performance/efficiency comparisons, dense vs sparse embeddings, pooling strategies, and scaling law. Lastly, the survey highlights the limitations and challenges in adapting LLMs for embeddings, including cross-task embedding quality, trade-offs between efficiency and accuracy, low-resource, long-context, data bias, robustness, etc. This survey serves as a valuable resource for researchers and practitioners by synthesizing current advancements, highlighting key challenges, and offering a comprehensive framework for future work aimed at enhancing the effectiveness and efficiency of LLMs as embedding models.

LLMs are Also Effective Embedding Models: An In-depth Overview

TL;DR

The paper surveys the shift from encoder-centric embeddings to decoder-only LLM-based embeddings for NLP and information retrieval tasks. It analyzes two core paradigms—tuning-free direct prompting and tuning-based fine-tuning—along with architecture, objectives, and data construction. It further reviews specialized embedding settings (multilingual, code, long context, cross-modal, reasoning-aware, domain-specific) and discusses data synthesis, hard-negative mining, and scaling considerations. It also highlights enduring challenges such as efficiency-accuracy trade-offs, bias, robustness, and long-context handling, offering a structured roadmap for future research.

Abstract

Large language models (LLMs) have revolutionized natural language processing by achieving state-of-the-art performance across various tasks. Recently, their effectiveness as embedding models has gained attention, marking a paradigm shift from traditional encoder-only models like ELMo and BERT to decoder-only, large-scale LLMs such as GPT, LLaMA, and Mistral. This survey provides an in-depth overview of this transition, beginning with foundational techniques before the LLM era, followed by LLM-based embedding models through two main strategies to derive embeddings from LLMs. 1) Direct prompting: We mainly discuss the prompt designs and the underlying rationale for deriving competitive embeddings. 2) Data-centric tuning: We cover extensive aspects that affect tuning an embedding model, including model architecture, training objectives, data constructions, etc. Upon the above, we also cover advanced methods for producing embeddings from longer texts, multilingual, code, cross-modal data, as well as reasoning-aware and other domain-specific scenarios. Furthermore, we discuss factors affecting choices of embedding models, such as performance/efficiency comparisons, dense vs sparse embeddings, pooling strategies, and scaling law. Lastly, the survey highlights the limitations and challenges in adapting LLMs for embeddings, including cross-task embedding quality, trade-offs between efficiency and accuracy, low-resource, long-context, data bias, robustness, etc. This survey serves as a valuable resource for researchers and practitioners by synthesizing current advancements, highlighting key challenges, and offering a comprehensive framework for future work aimed at enhancing the effectiveness and efficiency of LLMs as embedding models.

Paper Structure

This paper contains 23 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: An overview of our survey on deploying LLMs as embedding models, covering both ① direct prompting as embedding models, ② fine-tuning as embedding models and ③ specialized embedding mdels.
  • Figure 2: Prompts used in various tuning-free methods.
  • Figure 3: An illustration of the evolution of text embeddings from ELMo and BERT to GPT-style models, from zaib2020short.
  • Figure 4: An overview of contextual expansion for embedding models. RMs means retrieval models.
  • Figure 5: An illustration of Matryoshka Embedding, from kusupati2022matryoshka.
  • ...and 2 more figures