Table of Contents
Fetching ...

DCIS: Efficient Length Extrapolation of LLMs via Divide-and-Conquer Scaling Factor Search

Lei Yang, Shaoyang Xu, Jianxiang Peng, Shaolin Zhu, Deyi Xiong

TL;DR

DCIS tackles the challenge of extending LLM context length by learning RoPE rotation-frequency scaling factors through a Divide-and-Conquer Incremental Search, coupled with fine-tuning on short contexts. The method searches factors during inference using perplexity as a guide, then adapts the model with those factors to generalize to target lengths such as 64k tokens. Empirical results on Llama2-7B, Llama3-8B, and Mistral-7B-v0.1 show reduced performance decay at long contexts and, in many cases, improvement even without fine-tuning, with DCIS achieving roughly half the search space compared to LongRoPE and enabling faster convergence. The work demonstrates robust performance across multiple models and contexts, and reveals that non-strictly increasing scaling factors can enhance extrapolation while reducing fine-tuning costs.

Abstract

Large language models (LLMs) based on the Transformer architecture usually have their context length limited due to the high training cost. Recent advancements extend the context window by adjusting the scaling factors of RoPE and fine-tuning. However, suboptimal initialization of these factors results in increased fine-tuning costs and reduced performance at target length. To address these challenges, we propose a novel RoPE-based fine-tuning framework that diverges from conventional scaling factors search. Specifically, we present a \textbf{D}ivide-and-\textbf{C}onquer \textbf{I}ncremental \textbf{S}earch (DCIS) algorithm that strategically determines the better scaling factors. Further fine-tuning with the identified scaling factors effectively extends the context window of LLMs. Empirical results demonstrate that our methodology not only mitigates performance decay at extended target lengths but also allows the model to fine-tune on short contexts and generalize to long contexts, thereby reducing the cost of fine-tuning. The scaling factors obtained through DCIS can even perform effectively without fine-tuning. Further analysis of the search space reveals that DCIS achieves twice the search efficiency compared to other methods. We also examine the impact of the non-strictly increasing scaling factors utilized in DCIS and evaluate the general capabilities of LLMs across various context lengths.

DCIS: Efficient Length Extrapolation of LLMs via Divide-and-Conquer Scaling Factor Search

TL;DR

DCIS tackles the challenge of extending LLM context length by learning RoPE rotation-frequency scaling factors through a Divide-and-Conquer Incremental Search, coupled with fine-tuning on short contexts. The method searches factors during inference using perplexity as a guide, then adapts the model with those factors to generalize to target lengths such as 64k tokens. Empirical results on Llama2-7B, Llama3-8B, and Mistral-7B-v0.1 show reduced performance decay at long contexts and, in many cases, improvement even without fine-tuning, with DCIS achieving roughly half the search space compared to LongRoPE and enabling faster convergence. The work demonstrates robust performance across multiple models and contexts, and reveals that non-strictly increasing scaling factors can enhance extrapolation while reducing fine-tuning costs.

Abstract

Large language models (LLMs) based on the Transformer architecture usually have their context length limited due to the high training cost. Recent advancements extend the context window by adjusting the scaling factors of RoPE and fine-tuning. However, suboptimal initialization of these factors results in increased fine-tuning costs and reduced performance at target length. To address these challenges, we propose a novel RoPE-based fine-tuning framework that diverges from conventional scaling factors search. Specifically, we present a \textbf{D}ivide-and-\textbf{C}onquer \textbf{I}ncremental \textbf{S}earch (DCIS) algorithm that strategically determines the better scaling factors. Further fine-tuning with the identified scaling factors effectively extends the context window of LLMs. Empirical results demonstrate that our methodology not only mitigates performance decay at extended target lengths but also allows the model to fine-tune on short contexts and generalize to long contexts, thereby reducing the cost of fine-tuning. The scaling factors obtained through DCIS can even perform effectively without fine-tuning. Further analysis of the search space reveals that DCIS achieves twice the search efficiency compared to other methods. We also examine the impact of the non-strictly increasing scaling factors utilized in DCIS and evaluate the general capabilities of LLMs across various context lengths.

Paper Structure

This paper contains 23 sections, 8 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: The Llama2-7B model is expanded to 64k context window. We test the PPL using 10 Proof-pile samples with a minimum length of 128k tokens. Fine-tuning is performed using the default method described in Section \ref{['sec:exper']}. $"\text{-64k/16k}"$ indicates that fine-tuning on a 64k/16k length generalizes to a target length of 64k.
  • Figure 2: Diagram of the proposed DCIS framework. We illustrate the search procedure when $d=16$, with 3 incremental values selected for each processing step. Since the scaling factors are divided into high-frequency and low-frequency parts, we will initially process them in two segments. First, DCIS searches the scaling factors (i.e., $\lambda_4 - \lambda_7$) for the last 4 positions and gets 3 incremental values (i.e., $v_i$) within the range $[l_0,r_0]$. It then computes the PPL (i.e., $p_i$) for each incremental value. Finally, it selects the best incremental value with the lowest PPL to update these 4 scaling factors. As the input sequence is divided into two segments, DCIS processes the first 4 scaling factors in the first segment in the same manner. At the second layer, DCIS processes 2 scaling factors at a time, and at the third layer, it processes 1 scaling factor at a time, so on so forth. Finally, the process ends with the obtained scaling factors from the search.
  • Figure 3: The recall rate of passkey with different lengths. Higher values indicate better performance.
  • Figure 4: PPL across different models, lengths, and datasets without fine-tuning.
  • Figure 5: Exploration of Adaptive Scaling Factor (ASF).
  • ...and 3 more figures