Table of Contents
Fetching ...

On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

Meishan Zhang, Xin Zhang, Xinping Zhao, Shouzheng Huang, Baotian Hu, Min Zhang

TL;DR

The paper surveys the role of pretrained language models in general-purpose text embeddings, detailing how PLMs provide the backbone for embedding extraction, expressivity, and training strategies, while enabling advanced capabilities like multilingual and multimodal representations. It highlights a unified GPTE architecture based on PLM backbones and contrastive learning, discusses data synthesis and evolving benchmarks, and analyzes architecture choices and model scale. The review then outlines future directions, including integration with text ranking, safety and bias mitigation, structure-aware learning, and reasoning-enhanced embeddings. Overall, the work clarifies how PLMs drive GPTE progress and points to practical avenues for scalable, adaptable, and responsible embedding systems with broad NLP and IR impact.

Abstract

Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, including retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction. We then describe advanced roles enabled by PLMs, including multilingual support, multimodal integration, code understanding, and scenario-specific adaptation. Finally, we highlight potential future research directions that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings. This survey aims to serve as a valuable reference for both newcomers and established researchers seeking to understand the current state and future potential of GPTE.

On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

TL;DR

The paper surveys the role of pretrained language models in general-purpose text embeddings, detailing how PLMs provide the backbone for embedding extraction, expressivity, and training strategies, while enabling advanced capabilities like multilingual and multimodal representations. It highlights a unified GPTE architecture based on PLM backbones and contrastive learning, discusses data synthesis and evolving benchmarks, and analyzes architecture choices and model scale. The review then outlines future directions, including integration with text ranking, safety and bias mitigation, structure-aware learning, and reasoning-enhanced embeddings. Overall, the work clarifies how PLMs drive GPTE progress and points to practical avenues for scalable, adaptable, and responsible embedding systems with broad NLP and IR impact.

Abstract

Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, including retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction. We then describe advanced roles enabled by PLMs, including multilingual support, multimodal integration, code understanding, and scenario-specific adaptation. Finally, we highlight potential future research directions that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings. This survey aims to serve as a valuable reference for both newcomers and established researchers seeking to understand the current state and future potential of GPTE.

Paper Structure

This paper contains 38 sections, 1 equation, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Taxonomy of PLMs' Roles in GPTE.
  • Figure 2: Four typical applications of text embedding.
  • Figure 3: The typical architecture and training manner of GPTE models.
  • Figure 4: Comparisons of GPTE models with various PLM backbones, focusing on those of widely-adopted open-source PLMs.