Table of Contents
Fetching ...

Long Context Alignment with Short Instructions and Synthesized Positions

Wenhao Wu, Yizhong Wang, Yao Fu, Xiang Yue, Dawei Zhu, Sujian Li

TL;DR

This work tackles the challenge of enabling LLMs to handle extremely long contexts without collecting new long-input data or altering model architectures. It introduces SkipAlign, a method that synthetically creates long-range dependencies by strategically shifting and skipping positional indices within short instruction-response samples, thereby extending relative distances without longer training data. Across base models with varying context windows, SkipAlign demonstrates strong long-context performance on LongBench and outperforms several baselines, including matching GPT-3.5-Turbo-16k on 6B parameter scales and excelling in Needle-in-a-Haystack tasks, underscoring the importance of long-range dependencies over mere sample length. The approach is computationally efficient, preserves short-text capabilities with minor trade-offs, and shows that the quality of base models and alignment data significantly shapes long-context gains. The paper also outlines future directions toward incorporating actual long-context annotations and extending pretraining to even larger context lengths (up to 1M tokens).

Abstract

Effectively handling instructions with extremely long context remains a challenge for Large Language Models (LLMs), typically necessitating high-quality long data and substantial computational resources. This paper introduces Step-Skipping Alignment (SkipAlign), a new technique designed to enhance the long-context capabilities of LLMs in the phase of alignment without the need for additional efforts beyond training with original data length. SkipAlign is developed on the premise that long-range dependencies are fundamental to enhancing an LLM's capacity of long context. Departing from merely expanding the length of input samples, SkipAlign synthesizes long-range dependencies from the aspect of positions indices. This is achieved by the strategic insertion of skipped positions within instruction-following samples, which utilizes the semantic structure of the data to effectively expand the context. Through extensive experiments on base models with a variety of context window sizes, SkipAlign demonstrates its effectiveness across a spectrum of long-context tasks. Particularly noteworthy is that with a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.

Long Context Alignment with Short Instructions and Synthesized Positions

TL;DR

This work tackles the challenge of enabling LLMs to handle extremely long contexts without collecting new long-input data or altering model architectures. It introduces SkipAlign, a method that synthetically creates long-range dependencies by strategically shifting and skipping positional indices within short instruction-response samples, thereby extending relative distances without longer training data. Across base models with varying context windows, SkipAlign demonstrates strong long-context performance on LongBench and outperforms several baselines, including matching GPT-3.5-Turbo-16k on 6B parameter scales and excelling in Needle-in-a-Haystack tasks, underscoring the importance of long-range dependencies over mere sample length. The approach is computationally efficient, preserves short-text capabilities with minor trade-offs, and shows that the quality of base models and alignment data significantly shapes long-context gains. The paper also outlines future directions toward incorporating actual long-context annotations and extending pretraining to even larger context lengths (up to 1M tokens).

Abstract

Effectively handling instructions with extremely long context remains a challenge for Large Language Models (LLMs), typically necessitating high-quality long data and substantial computational resources. This paper introduces Step-Skipping Alignment (SkipAlign), a new technique designed to enhance the long-context capabilities of LLMs in the phase of alignment without the need for additional efforts beyond training with original data length. SkipAlign is developed on the premise that long-range dependencies are fundamental to enhancing an LLM's capacity of long context. Departing from merely expanding the length of input samples, SkipAlign synthesizes long-range dependencies from the aspect of positions indices. This is achieved by the strategic insertion of skipped positions within instruction-following samples, which utilizes the semantic structure of the data to effectively expand the context. Through extensive experiments on base models with a variety of context window sizes, SkipAlign demonstrates its effectiveness across a spectrum of long-context tasks. Particularly noteworthy is that with a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
Paper Structure (36 sections, 4 equations, 4 figures, 2 tables)

This paper contains 36 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: SkipAlign modifies positional indices in instruction-following samples to simulate long-range dependency relations. The provided example showcases how SkipAlign takes three distinct samples, each initially positioned within a 4096-token, and independently applies three separate strategies to stretch their lengths to an impressive 100K tokens.
  • Figure 2: The frequency of relative distance in the Tülu V2 dataset. Comparing with the original distribution, SkipAlign redistribute a small subset of samples into a longer context.
  • Figure 3: Needle in the Haystack test for Llama-2-7B based models: Llama-2-7B-NTK-50K denotes the straightforward expansion of Llama-2-7B using NTK to accommodate 50K tokens without further tuning. Normal-SFT-NTK-50K represents the adaptation of a standard fine-tuned model for this extended context. PackedSFT-50K indicates the fine-tuning process using samples artificially extended to 50K tokens for training.
  • Figure 4: Average score on LongBench for SkipAlign aross various maximum extension length $L$ and sub-sampling ratio p $p$.