Table of Contents
Fetching ...

Mesa-Extrapolation: A Weave Position Encoding Method for Enhanced Extrapolation in LLMs

Xin Ma, Yang Liu, Jingjing Liu, Xiaoxu Ma

TL;DR

It is established that LLMs equipped with weave PE can achieve improved extrapolation performance without additional cost, and a novel weave PE method is introduced, Mesa-Extrapolation, which utilizes a chunk-based triangular attention matrix and applies Stair PE to manage the final chunk.

Abstract

Large language models (LLMs), although having revolutionized many fields, still suffer from the challenging extrapolation problem, where the inference ability of LLMs sharply declines beyond their max training lengths. In this work, we conduct a theoretical analysis to better understand why No Position Encoding (NoPE) fails outside its effective range, as well as examining the power of Position Encoding (PE) in this context. Our findings reveal that with meticulous weave position, PE can indeed be extended beyond effective range. Our theorems establish that LLMs equipped with weave PE can achieve improved extrapolation performance without additional cost. Furthermore, we introduce a novel weave PE method, Mesa-Extrapolation, which utilizes a chunk-based triangular attention matrix and applies Stair PE to manage the final chunk. This method not only retains competitive performance but also offers substantial benefits such as significantly reduced memory demand and faster inference speed. Extensive experiments validate the effectiveness of Mesa-Extrapolation, demonstrating its potential as a scalable solution to enhancing LLMs applicative reach. Our code is available at \url{https://github.com/soacker/Mesa-Extrapolation}.

Mesa-Extrapolation: A Weave Position Encoding Method for Enhanced Extrapolation in LLMs

TL;DR

It is established that LLMs equipped with weave PE can achieve improved extrapolation performance without additional cost, and a novel weave PE method is introduced, Mesa-Extrapolation, which utilizes a chunk-based triangular attention matrix and applies Stair PE to manage the final chunk.

Abstract

Large language models (LLMs), although having revolutionized many fields, still suffer from the challenging extrapolation problem, where the inference ability of LLMs sharply declines beyond their max training lengths. In this work, we conduct a theoretical analysis to better understand why No Position Encoding (NoPE) fails outside its effective range, as well as examining the power of Position Encoding (PE) in this context. Our findings reveal that with meticulous weave position, PE can indeed be extended beyond effective range. Our theorems establish that LLMs equipped with weave PE can achieve improved extrapolation performance without additional cost. Furthermore, we introduce a novel weave PE method, Mesa-Extrapolation, which utilizes a chunk-based triangular attention matrix and applies Stair PE to manage the final chunk. This method not only retains competitive performance but also offers substantial benefits such as significantly reduced memory demand and faster inference speed. Extensive experiments validate the effectiveness of Mesa-Extrapolation, demonstrating its potential as a scalable solution to enhancing LLMs applicative reach. Our code is available at \url{https://github.com/soacker/Mesa-Extrapolation}.

Paper Structure

This paper contains 44 sections, 8 theorems, 78 equations, 17 figures, 4 tables, 2 algorithms.

Key Result

Theorem 3.1

Let $x = [\textless bos \textgreater, {x_1}, \ldots , {x_T}]$ be an input sequence of length $T+1$ to the model. Then, there exists $\bm{W}_Q$, $\bm{W}_K$, $\bm{W}_V$, $\bm{W}_O$, $\bm{W}_1$, and $\bm{W}_2$ matrices, such that when $T < M$, $o_T > \mathcal{H}$; and when $T > M$, $o_T < \mathcal{H}$.

Figures (17)

  • Figure 1: Chunk-based triangular attention matrix, PE and Stair PE. The left figure shows the Chunk-based triangular attention matrix (before SoftMax operation) of Mesa-Extrapolation when an exemplar sequence of length $13$ is fed into a LLM. The right figure shows an example of PE and Stair PE. The Stair PE is used to weave the relative position in Mesa-Extrapolation.
  • Figure 2: Thresholds for hidden states observed at specific dimensions on LLaMA2-7B-Chat, allowing for extrapolative judgments based on these thresholds. The vertical black dashed line indicate the position of maximum training length of the model. In this case, it is 4k for LLaMA2-7B-Chat model. The hidden state value at this position is designated as the observed threshold and marked with a horizontal red dashed line. When the hidden state value exceeds the red dashed line as the position changes, it signifies that the hidden state value has surpassed the threshold, suggesting a failure in extrapolation after that position.
  • Figure 3: Passkey Retrieval Accuracy for different methods on various LLMs. X-axis represents the input token length, and Y-axis represents the accuracy of password found by LLMs. Different color regions denote the variance value, averaged on $100$ samples for each input token length. The black dashed line represent the max training length for LLMs. Some observations: Weave PE-based methods, including ReRoPE, Leaky-ReRoPE, and Mesa-Extrapolation, consistently demonstrate stable extrapolation capabilities even when the input length surpasses the maximum training length. We claim that "early stopping" phenomenon in certain methods is attributed to GPU memory exhaustion under our existing hardware resources.
  • Figure 4: Perplexity (PPL) metrics on LLaMA models using the Pile dataset. Some observations: (1) The PPL value of Origin consistently increases when the maximum training length is exceeded. (2) Other methods maintain low PPL values, with Dynamic-NTK exhibiting a slight increase as the input length grows.
  • Figure 5: Memory Usage and Decoding Speed Comparison for LLaMA Models: 3B and 7B. The X-axis represents the input token length, the left Y-axis denotes memory usage, and the right Y-axis indicates speed about decoding time during inference. Some observations: (1) ReRoPE and Leaky-ReRoPE exhibit the largest memory footprint for the same input length, and their inference speed follows a quadratic function trend. (2) Mesa-Extrapolation shows an approximately linear inference speed, boasting the fastest inference speed and the smallest memory usage under the same input conditions.
  • ...and 12 more figures

Theorems & Definitions (12)

  • Theorem 3.1: NoPE Extrapolation
  • Theorem 3.2: PE Extrapolation
  • Theorem 3.3: Weave PE Extrapolation
  • Corollary 4.1: Mesa Extrapolation
  • Theorem E.1: NoPE Extrapolation
  • proof
  • Theorem E.2: PE Extrapolation
  • proof
  • Theorem E.3: Weave PE Extrapolation
  • proof
  • ...and 2 more