Long Context Alignment with Short Instructions and Synthesized Positions

Wenhao Wu; Yizhong Wang; Yao Fu; Xiang Yue; Dawei Zhu; Sujian Li

Long Context Alignment with Short Instructions and Synthesized Positions

Wenhao Wu, Yizhong Wang, Yao Fu, Xiang Yue, Dawei Zhu, Sujian Li

TL;DR

This work tackles the challenge of enabling LLMs to handle extremely long contexts without collecting new long-input data or altering model architectures. It introduces SkipAlign, a method that synthetically creates long-range dependencies by strategically shifting and skipping positional indices within short instruction-response samples, thereby extending relative distances without longer training data. Across base models with varying context windows, SkipAlign demonstrates strong long-context performance on LongBench and outperforms several baselines, including matching GPT-3.5-Turbo-16k on 6B parameter scales and excelling in Needle-in-a-Haystack tasks, underscoring the importance of long-range dependencies over mere sample length. The approach is computationally efficient, preserves short-text capabilities with minor trade-offs, and shows that the quality of base models and alignment data significantly shapes long-context gains. The paper also outlines future directions toward incorporating actual long-context annotations and extending pretraining to even larger context lengths (up to 1M tokens).

Abstract

Effectively handling instructions with extremely long context remains a challenge for Large Language Models (LLMs), typically necessitating high-quality long data and substantial computational resources. This paper introduces Step-Skipping Alignment (SkipAlign), a new technique designed to enhance the long-context capabilities of LLMs in the phase of alignment without the need for additional efforts beyond training with original data length. SkipAlign is developed on the premise that long-range dependencies are fundamental to enhancing an LLM's capacity of long context. Departing from merely expanding the length of input samples, SkipAlign synthesizes long-range dependencies from the aspect of positions indices. This is achieved by the strategic insertion of skipped positions within instruction-following samples, which utilizes the semantic structure of the data to effectively expand the context. Through extensive experiments on base models with a variety of context window sizes, SkipAlign demonstrates its effectiveness across a spectrum of long-context tasks. Particularly noteworthy is that with a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.

Long Context Alignment with Short Instructions and Synthesized Positions

TL;DR

Abstract

Paper Structure (36 sections, 4 equations, 4 figures, 2 tables)

This paper contains 36 sections, 4 equations, 4 figures, 2 tables.

Introduction
Related Work
Long Context Scaling
Long Context Evaluation
Skip Position Training
Methodology
Preliminary
Instruction Tuning
Packed-SFT
Position Indices
SkipAlign
Skipping Positions via Shifting
Skipping Strategy
Frequency of Relative Distances
Experimental Setup
...and 21 more sections

Figures (4)

Figure 1: SkipAlign modifies positional indices in instruction-following samples to simulate long-range dependency relations. The provided example showcases how SkipAlign takes three distinct samples, each initially positioned within a 4096-token, and independently applies three separate strategies to stretch their lengths to an impressive 100K tokens.
Figure 2: The frequency of relative distance in the Tülu V2 dataset. Comparing with the original distribution, SkipAlign redistribute a small subset of samples into a longer context.
Figure 3: Needle in the Haystack test for Llama-2-7B based models: Llama-2-7B-NTK-50K denotes the straightforward expansion of Llama-2-7B using NTK to accommodate 50K tokens without further tuning. Normal-SFT-NTK-50K represents the adaptation of a standard fine-tuned model for this extended context. PackedSFT-50K indicates the fine-tuning process using samples artificially extended to 50K tokens for training.
Figure 4: Average score on LongBench for SkipAlign aross various maximum extension length $L$ and sub-sampling ratio p $p$.

Long Context Alignment with Short Instructions and Synthesized Positions

TL;DR

Abstract

Long Context Alignment with Short Instructions and Synthesized Positions

Authors

TL;DR

Abstract

Table of Contents

Figures (4)