DCIS: Efficient Length Extrapolation of LLMs via Divide-and-Conquer Scaling Factor Search
Lei Yang, Shaoyang Xu, Jianxiang Peng, Shaolin Zhu, Deyi Xiong
TL;DR
DCIS tackles the challenge of extending LLM context length by learning RoPE rotation-frequency scaling factors through a Divide-and-Conquer Incremental Search, coupled with fine-tuning on short contexts. The method searches factors during inference using perplexity as a guide, then adapts the model with those factors to generalize to target lengths such as 64k tokens. Empirical results on Llama2-7B, Llama3-8B, and Mistral-7B-v0.1 show reduced performance decay at long contexts and, in many cases, improvement even without fine-tuning, with DCIS achieving roughly half the search space compared to LongRoPE and enabling faster convergence. The work demonstrates robust performance across multiple models and contexts, and reveals that non-strictly increasing scaling factors can enhance extrapolation while reducing fine-tuning costs.
Abstract
Large language models (LLMs) based on the Transformer architecture usually have their context length limited due to the high training cost. Recent advancements extend the context window by adjusting the scaling factors of RoPE and fine-tuning. However, suboptimal initialization of these factors results in increased fine-tuning costs and reduced performance at target length. To address these challenges, we propose a novel RoPE-based fine-tuning framework that diverges from conventional scaling factors search. Specifically, we present a \textbf{D}ivide-and-\textbf{C}onquer \textbf{I}ncremental \textbf{S}earch (DCIS) algorithm that strategically determines the better scaling factors. Further fine-tuning with the identified scaling factors effectively extends the context window of LLMs. Empirical results demonstrate that our methodology not only mitigates performance decay at extended target lengths but also allows the model to fine-tune on short contexts and generalize to long contexts, thereby reducing the cost of fine-tuning. The scaling factors obtained through DCIS can even perform effectively without fine-tuning. Further analysis of the search space reveals that DCIS achieves twice the search efficiency compared to other methods. We also examine the impact of the non-strictly increasing scaling factors utilized in DCIS and evaluate the general capabilities of LLMs across various context lengths.
