Table of Contents
Fetching ...

SELF: Self-Extend the Context Length With Logistic Growth Function

Phat Thanh Dang, Saahil Thoppay, Wang Yang, Qifan Wang, Vipin Chaudhary, Xiaotian Han

TL;DR

This work tackles the deterioration of long-context reasoning in LLMs caused by relative-position encoding limitations by proposing SELF, a dynamic token-grouping scheme guided by a logistic growth function. SELF blends neighbor attention for nearby tokens with gradually expanding group sizes for distant tokens, enabling longer effective context without retraining. The authors provide a concrete formulation for the logistic grouping, an efficient parallel implementation, and empirical results showing perplexity and long-context task performance improvements across multiple models and benchmarks, notably Llama-2-7B and Qwen-7B on LEval and LongBench. While benefits are substantial in many settings, some models exhibit variability, highlighting the importance of model-specific behavior and computational trade-offs in applying SELF. Overall, SELF offers a practical pathway to extend context lengths while preserving short-context performance, with direct implications for scalability of long-context reasoning in real-world applications.

Abstract

Large language models suffer issues when operated on long contexts that are larger than their training context length due to the standard position encoding for tokens in the attention layer. Tokens a long distance apart will rarely have an effect on each other and long prompts yield unexpected results. To solve this problem, we propose SELF (Self-Extend the Context Length With Logistic Growth Function): a solution of grouping consecutive tokens at varying group sizes using a logistic capacity equation combined with a constant group size at smaller relative distances. Our model had an increase in performance of up to 12% compared to the LongLM extension method in LEval (specifically on the Qwen model). On summarization related tasks in LongBench, our model performed up to 6.4% better than LongLM (specifically on the Llama-2-7b model). On reading comprehension tasks from LEval, our model performed up to 5.4% better than the LongLM. Our code is available at https://github.com/alexeipc/SELF-LLM.

SELF: Self-Extend the Context Length With Logistic Growth Function

TL;DR

This work tackles the deterioration of long-context reasoning in LLMs caused by relative-position encoding limitations by proposing SELF, a dynamic token-grouping scheme guided by a logistic growth function. SELF blends neighbor attention for nearby tokens with gradually expanding group sizes for distant tokens, enabling longer effective context without retraining. The authors provide a concrete formulation for the logistic grouping, an efficient parallel implementation, and empirical results showing perplexity and long-context task performance improvements across multiple models and benchmarks, notably Llama-2-7B and Qwen-7B on LEval and LongBench. While benefits are substantial in many settings, some models exhibit variability, highlighting the importance of model-specific behavior and computational trade-offs in applying SELF. Overall, SELF offers a practical pathway to extend context lengths while preserving short-context performance, with direct implications for scalability of long-context reasoning in real-world applications.

Abstract

Large language models suffer issues when operated on long contexts that are larger than their training context length due to the standard position encoding for tokens in the attention layer. Tokens a long distance apart will rarely have an effect on each other and long prompts yield unexpected results. To solve this problem, we propose SELF (Self-Extend the Context Length With Logistic Growth Function): a solution of grouping consecutive tokens at varying group sizes using a logistic capacity equation combined with a constant group size at smaller relative distances. Our model had an increase in performance of up to 12% compared to the LongLM extension method in LEval (specifically on the Qwen model). On summarization related tasks in LongBench, our model performed up to 6.4% better than LongLM (specifically on the Llama-2-7b model). On reading comprehension tasks from LEval, our model performed up to 5.4% better than the LongLM. Our code is available at https://github.com/alexeipc/SELF-LLM.

Paper Structure

This paper contains 18 sections, 12 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of our method in extending context length. Given a sequence of length $n$, that is larger than the training sequence length, the model groups consecutive tokens into groups whose sizes are determined by a function with the help of the neighbor window. As a result, the greatest index is now $m < n$, and the sequence now can be fully in the model's scope.
  • Figure 2: Illustration about relation between neighbor window and perplexity after applying Self-Extend jin2024llmmaybelonglmselfextend. The results is derived from testing Llama-2-7B and its Self-Extend variants on the first book in PG19 raecompressive2019 with sequences of 2048 tokens. The perplexity of models applying Self-Extend slowly approaches the perplexity of the original model when increasing the neighbor window size.
  • Figure 3: Illustration of the relation between $G^K$ and $G^Q$ knowing that the relative position right after the neighbor window has to be $W$.
  • Figure 4: Illustration of the algorithm grouping the indices using the function $f:\mathbb{N}\rightarrow \mathbb{N}$, where $f(0)=1,f(1)=2,f(2)=2, f(3)=3$ and $f(4)=3$. The sequence with length of $n=11$ was run the model with the pretraining sequence length of $L=6$. The numbers denote the relative position between the corresponding key and query token. It has two kinds of self-attention, similar to Self-Extend jin2024llmmaybelonglmselfextend: neighbor tokens inside the neighbor window ($W=3$) (blue cells in the figure) use regular self-attention; group tokens outside the neighbor window (orange cells in the figure) use group self-attention (group indices are denoted as the $G$ row and column in the figure). Green $G^Q$ means it can be anything as it is covered completed by the neighbor window.
  • Figure 5: Compare the trade off with different group sizes of SE and SELF on 2WikiMultihopQA. The two grouping methods has the same neighbor window size $W=1024$.