Table of Contents
Fetching ...

An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding

Tong Wu, Yanpeng Zhao, Zilong Zheng

TL;DR

A truncated Gaussian is introduced to encourage sampling from the middle part of the context during fine-tuning, thus alleviating the "Lost-in-the-Middle" problem faced by long-context LLMs.

Abstract

Recently, many methods have been developed to extend the context length of pre-trained large language models (LLMs), but they often require fine-tuning at the target length ($\gg4K$) and struggle to effectively utilize information from the middle part of the context. To address these issues, we propose $\textbf{C}$ontinuity-$\textbf{R}$elativity ind$\textbf{E}$xing with g$\textbf{A}$ussian $\textbf{M}$iddle ($\texttt{CREAM}$), which interpolates positional encodings by manipulating position indices. Apart from being simple, $\texttt{CREAM}$ is training-efficient: it only requires fine-tuning at the pre-trained context window (e.g., Llama 2-4K) and can extend LLMs to a much longer target context length (e.g., 256K). To ensure that the model focuses more on the information in the middle, we introduce a truncated Gaussian to encourage sampling from the middle part of the context during fine-tuning, thus alleviating the "Lost-in-the-Middle" problem faced by long-context LLMs. Experimental results show that $\texttt{CREAM}$ successfully extends LLMs to the target length for both Base and Chat versions of $\texttt{Llama2-7B}$ with "Never Miss A Beat". Our code is publicly available at https://github.com/bigai-nlco/cream.

An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding

TL;DR

A truncated Gaussian is introduced to encourage sampling from the middle part of the context during fine-tuning, thus alleviating the "Lost-in-the-Middle" problem faced by long-context LLMs.

Abstract

Recently, many methods have been developed to extend the context length of pre-trained large language models (LLMs), but they often require fine-tuning at the target length () and struggle to effectively utilize information from the middle part of the context. To address these issues, we propose ontinuity-elativity indxing with gussian iddle (), which interpolates positional encodings by manipulating position indices. Apart from being simple, is training-efficient: it only requires fine-tuning at the pre-trained context window (e.g., Llama 2-4K) and can extend LLMs to a much longer target context length (e.g., 256K). To ensure that the model focuses more on the information in the middle, we introduce a truncated Gaussian to encourage sampling from the middle part of the context during fine-tuning, thus alleviating the "Lost-in-the-Middle" problem faced by long-context LLMs. Experimental results show that successfully extends LLMs to the target length for both Base and Chat versions of with "Never Miss A Beat". Our code is publicly available at https://github.com/bigai-nlco/cream.
Paper Structure (46 sections, 3 theorems, 19 equations, 10 figures, 10 tables, 1 algorithm)

This paper contains 46 sections, 3 theorems, 19 equations, 10 figures, 10 tables, 1 algorithm.

Key Result

Theorem B.1

If $N \ll L$, the spanning size $|D_r|$ of the relative position union in eq:relative_set reaches its maximum iff. one of the following groups of inequalities satisfies: or where $\max|D_r| = \max(L_h-1, P_e - P_s, L_t-1) + 2N$.

Figures (10)

  • Figure 1: Results of applying different position interpolation methods to the "Lost-in-the-Middle" task on CREAM and PoSE PoSE. We can see that CREAM outperforms PoSE PoSE at every position, with a particularly improvement in the middle.
  • Figure 2: Illustration of CREAM position interpolation. The pre-trained context window is divided into three segments: the head, middle, and tail. To ensure continuity, we fix the lengths of the head and tail to a small value $k$. To maintain relativity, we set the lengths of the head and tail to $N/3$. For the middle part, the start and end position indices are determined via truncated Gaussian sampling, thereby encouraging the model to pay more attention to the information in the middle part.
  • Figure 3: Results (%) on LongChat-Lines. Each length consists of 50 samples. All results are fine-tuned on Llama-2-7B with 4K length data through linear position interpolation. Refer to \ref{['app:longchat_lines']} for ablated results using NTK NTK and Yarn YaRN.
  • Figure 4: Results on Needle-in-a-Haystack.$^\dag$ indicates the results excerpted from skipAlign. Both results are instruction-tuned on LLaMa2-7B-Chat with 4K length data. The color gradually changes from deep green to deep red, indicating the Recall performance decreases from 10 to 1.
  • Figure 5: Ablation study of CREAM on LongChat-Lines. The result at each length is estimated using 50 samples.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Definition 2.1
  • Theorem B.1
  • Lemma B.2
  • Theorem B.3