Training LLMs to be Better Text Embedders through Bidirectional Reconstruction
Chang Su, Dengliang Shi, Siyuan Huang, Jintao Du, Changhua Meng, Yu Cheng, Weiqiang Wang, Zhouhan Lin
TL;DR
The paper tackles the problem that final-token embeddings from decoder-only LLMs, notably the $EOS$ embedding, poorly capture full-context semantics needed for text embeddings. It introduces a two-stage training paradigm: Stage I with bidirectional reconstruction (EBQ2D and EBD2Q) to anchor the $EOS$ representation and inject semantic alignment, followed by Stage II contrastive learning to refine the embedding space. Empirical results on MTEB show consistent gains across multiple base models (1B–8B) and set new state-of-the-art performance using only public data in zero-shot settings, with ablations confirming the effectiveness of both stages and the reconstruction objectives. The approach also demonstrates faster early convergence and robustness across tasks, suggesting practical impact for retrieval, reranking, and related embedding tasks in real systems.
Abstract
Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as [EOS]. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the [EOS] embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.
