Long Context Alignment with Short Instructions and Synthesized Positions
Wenhao Wu, Yizhong Wang, Yao Fu, Xiang Yue, Dawei Zhu, Sujian Li
TL;DR
This work tackles the challenge of enabling LLMs to handle extremely long contexts without collecting new long-input data or altering model architectures. It introduces SkipAlign, a method that synthetically creates long-range dependencies by strategically shifting and skipping positional indices within short instruction-response samples, thereby extending relative distances without longer training data. Across base models with varying context windows, SkipAlign demonstrates strong long-context performance on LongBench and outperforms several baselines, including matching GPT-3.5-Turbo-16k on 6B parameter scales and excelling in Needle-in-a-Haystack tasks, underscoring the importance of long-range dependencies over mere sample length. The approach is computationally efficient, preserves short-text capabilities with minor trade-offs, and shows that the quality of base models and alignment data significantly shapes long-context gains. The paper also outlines future directions toward incorporating actual long-context annotations and extending pretraining to even larger context lengths (up to 1M tokens).
Abstract
Effectively handling instructions with extremely long context remains a challenge for Large Language Models (LLMs), typically necessitating high-quality long data and substantial computational resources. This paper introduces Step-Skipping Alignment (SkipAlign), a new technique designed to enhance the long-context capabilities of LLMs in the phase of alignment without the need for additional efforts beyond training with original data length. SkipAlign is developed on the premise that long-range dependencies are fundamental to enhancing an LLM's capacity of long context. Departing from merely expanding the length of input samples, SkipAlign synthesizes long-range dependencies from the aspect of positions indices. This is achieved by the strategic insertion of skipped positions within instruction-following samples, which utilizes the semantic structure of the data to effectively expand the context. Through extensive experiments on base models with a variety of context window sizes, SkipAlign demonstrates its effectiveness across a spectrum of long-context tasks. Particularly noteworthy is that with a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
