Table of Contents
Fetching ...

Layer-Specific Scaling of Positional Encodings for Superior Long-Context Modeling

Zhenghua Wang, Yiran Ding, Changze Lv, Zhibo Xu, Tianlong Li, Tianyuan Shi, Xiaoqing Zheng, Xuanjing Huang

TL;DR

The paper tackles the lost-in-the-middle problem in long-context LLMs caused by RoPE's long-term decay. It introduces a layer-specific RoPE scaling method whose per-layer factors follow a Bezier-curve parameterization and are optimized via a genetic algorithm, enabling efficient search within a constrained space. Empirical results across multiple 7B-class models show up to +20% improvement on Key-Value Retrieval and better extrapolation on PG19, without adding inference latency. The work provides practical, generalizable insights into layer-wise attention distribution and offers a scalable approach to enhance long-context modeling in diverse LLM architectures.

Abstract

Although large language models (LLMs) have achieved significant progress in handling long-context inputs, they still suffer from the ``lost-in-the-middle'' problem, where crucial information in the middle of the context is often underrepresented or lost. Our extensive experiments reveal that this issue may arise from the rapid long-term decay in Rotary Position Embedding (RoPE). To address this problem, we propose a layer-specific positional encoding scaling method that assigns distinct scaling factors to each layer, slowing down the decay rate caused by RoPE to make the model pay more attention to the middle context. A specially designed genetic algorithm is employed to efficiently select the optimal scaling factors for each layer by incorporating Bezier curves to reduce the search space. Through comprehensive experimentation, we demonstrate that our method significantly alleviates the ``lost-in-the-middle'' problem. Our approach results in an average accuracy improvement of up to 20% on the Key-Value Retrieval dataset. Furthermore, we show that layer-specific interpolation, as opposed to uniform interpolation across all layers, enhances the model's extrapolation capabilities when combined with PI and Dynamic-NTK positional encoding schemes.

Layer-Specific Scaling of Positional Encodings for Superior Long-Context Modeling

TL;DR

The paper tackles the lost-in-the-middle problem in long-context LLMs caused by RoPE's long-term decay. It introduces a layer-specific RoPE scaling method whose per-layer factors follow a Bezier-curve parameterization and are optimized via a genetic algorithm, enabling efficient search within a constrained space. Empirical results across multiple 7B-class models show up to +20% improvement on Key-Value Retrieval and better extrapolation on PG19, without adding inference latency. The work provides practical, generalizable insights into layer-wise attention distribution and offers a scalable approach to enhance long-context modeling in diverse LLM architectures.

Abstract

Although large language models (LLMs) have achieved significant progress in handling long-context inputs, they still suffer from the ``lost-in-the-middle'' problem, where crucial information in the middle of the context is often underrepresented or lost. Our extensive experiments reveal that this issue may arise from the rapid long-term decay in Rotary Position Embedding (RoPE). To address this problem, we propose a layer-specific positional encoding scaling method that assigns distinct scaling factors to each layer, slowing down the decay rate caused by RoPE to make the model pay more attention to the middle context. A specially designed genetic algorithm is employed to efficiently select the optimal scaling factors for each layer by incorporating Bezier curves to reduce the search space. Through comprehensive experimentation, we demonstrate that our method significantly alleviates the ``lost-in-the-middle'' problem. Our approach results in an average accuracy improvement of up to 20% on the Key-Value Retrieval dataset. Furthermore, we show that layer-specific interpolation, as opposed to uniform interpolation across all layers, enhances the model's extrapolation capabilities when combined with PI and Dynamic-NTK positional encoding schemes.

Paper Structure

This paper contains 18 sections, 14 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Average accuracy on MDQA (a) and the Key-Value Retrieval (b) datasets. By applying layer-specific scaling to enhance middle-context attention, we achieved an average accuracy improvement of +$20\%$ on the Key-Value Retrieval dataset and +$2.7\%$ on the MDQA dataset.
  • Figure 2: Modeling with Bézier Curves: First, we optimize the initial control points to optimal points using a constrained genetic algorithm. Then, we apply scaling factors derived from the fitted curve to each layer. The bottom part of the figure illustrates the model structure, alongside a comparison between the scaled attention and the normal attention.
  • Figure 3: Cosine similarity between the average of first 128 tokens and the later tokens reveals the model heightened focus on earlier content without positional encoding.
  • Figure 4: The rapid decay of RoPE biases local focus, while the scaling operation can slow the decay, contributing to the enhancement of global attention capacity.
  • Figure 5: Cosine similarity between the average of middle context and the last token reveals the model heightened focus on middle context as the scaling factor increases.
  • ...and 5 more figures