Table of Contents
Fetching ...

LongEmbed: Extending Embedding Models for Long Context Retrieval

Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li

TL;DR

This work tackles the bottleneck of embedding models' short context windows by introducing training-free methods to extend context length up to 32k tokens, evaluated on the LongEmbed benchmark that combines synthetic and real long-context tasks. It shows that RoPE-based embeddings consistently outperform APE-based ones in long-context extension, with NTK-Aware Interpolation and SelfExtend delivering the strongest gains, including dramatic improvements on 32k-token tasks. The paper also demonstrates that with targeted extension strategies, existing embeddings can approach far longer-context retrieval performance without retraining, and it releases new models and benchmarks to accelerate research in long-context embedding.

Abstract

Embedding models play a pivot role in modern NLP applications such as IR and RAG. While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts. This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models. Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for context window extension. To facilitate future research, we release E5-Base-4k and E5-RoPE-Base, along with the LongEmbed benchmark.

LongEmbed: Extending Embedding Models for Long Context Retrieval

TL;DR

This work tackles the bottleneck of embedding models' short context windows by introducing training-free methods to extend context length up to 32k tokens, evaluated on the LongEmbed benchmark that combines synthetic and real long-context tasks. It shows that RoPE-based embeddings consistently outperform APE-based ones in long-context extension, with NTK-Aware Interpolation and SelfExtend delivering the strongest gains, including dramatic improvements on 32k-token tasks. The paper also demonstrates that with targeted extension strategies, existing embeddings can approach far longer-context retrieval performance without retraining, and it releases new models and benchmarks to accelerate research in long-context embedding.

Abstract

Embedding models play a pivot role in modern NLP applications such as IR and RAG. While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts. This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models. Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for context window extension. To facilitate future research, we release E5-Base-4k and E5-RoPE-Base, along with the LongEmbed benchmark.
Paper Structure (20 sections, 2 equations, 7 figures, 7 tables)

This paper contains 20 sections, 2 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: (a) Overview of the LongEmbed benchmark. (b) Performance of current embedding models on passkey retrieval, with evaluation length ranging from 256 to 32,768 . $\blacktriangle$ / $\blacklozenge$ denotes embedding models with 512 / $\ge$ 4k context. The greener a cell is, the higher retrieval accuracy this model achieves on the corresponding evaluation length. (c) Effects of context window extension methods on E5, E5-RoPE, E5-Mistral, measured by improvements of Avg. Scores on LongEmbed. SE / NTK is short for SelfExtend / NTK-Aware Interpolation.
  • Figure 2: Results of E5Base on 8 LoCo tasks that are publicly available.
  • Figure 3: Example for the passkey and needle test. For the passkey test, the <prefix / suffix> are repeats of "The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again." For the needle test, the <prefix> and <suffix> form a long essay.
  • Figure 4: (Left) Arrangement of pids for extending APE-based models from 512 to 1,024. (Right) Illustration of learnable () and frozen () position vectors when further tuning on RP / PI.
  • Figure 5: Effects of different context window extension methods on E5Base and GTEBase. We show that further tuning yields the best results.
  • ...and 2 more figures