LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models

Zhiyuan Hu; Yuliang Liu; Jinman Zhao; Suyuchen Wang; Yan Wang; Wei Shen; Qing Gu; Anh Tuan Luu; See-Kiong Ng; Zhiwei Jiang; Bryan Hooi

LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models

Zhiyuan Hu, Yuliang Liu, Jinman Zhao, Suyuchen Wang, Yan Wang, Wei Shen, Qing Gu, Anh Tuan Luu, See-Kiong Ng, Zhiwei Jiang, Bryan Hooi

TL;DR

LongRecipe tackles the inefficiency of extending LLM context windows by combining impactful token analysis with position index transformation and targeted training optimizations. It demonstrates that long-context abilities can be cultivated using only a fraction of the target context window and computational resources, via pretraining data replay and model merging to preserve base capabilities. Across multiple open-source LLMs, the approach yields consistent gains for 80k–128k contexts while maintaining core general abilities, approaching GPT-4–level performance under constrained resources. The work provides a practical pathway to scalable, efficient long-context generalization and releases code to enable replication and extension.

Abstract

Large language models (LLMs) face significant challenges in handling long-context tasks because of their limited effective context window size during pretraining, which restricts their ability to generalize over extended sequences. Meanwhile, extending the context window in LLMs through post-pretraining is highly resource-intensive. To address this, we introduce LongRecipe, an efficient training strategy for extending the context window of LLMs, including impactful token analysis, position index transformation, and training optimization strategies. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model's understanding of long-range dependencies. Experiments on three types of LLMs show that LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training. Furthermore, LongRecipe also preserves the original LLM's capabilities in general tasks. Ultimately, we can extend the effective context window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory. Our code is released at https://github.com/zhiyuanhubj/LongRecipe.

LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models

TL;DR

Abstract

Paper Structure (29 sections, 9 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 9 equations, 4 figures, 5 tables, 1 algorithm.

Introduction
Preliminary
Related Works
Methodology
Impactful Token Analysis
Position Index Transformation
Training Optimization Strategies
Experimental Setup
Baselines
Dataset and Evaluation
Setup
Experimental Performance
Long Context Generalization
Maintaining General Abilities
Ablation Study
...and 14 more sections

Figures (4)

Figure 1: Method Overview
Figure 2: We conduct context window extension experiments using Llama3-8B-I with an 80k token length. Starting from 10%, which represents 8k tokens per sample, 20% corresponds to 16k tokens, 30% to 24k tokens, and 40% to 32k tokens. The 100% configuration utilizes entire long sample.
Figure 3: Comparison of average distance among tokens for different methods and context window.
Figure 4: Frequency Distribution of Parts of Speech for Tokens with Significant Logits Changes Across Text Positions. INTJ (Interjection), SYM (Symbol), ADP (Adposition), AUX (Auxiliary), PRON (Pronoun), CCONJ (Conjunction), NUM (Numeral)

LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models

TL;DR

Abstract

LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)