Table of Contents
Fetching ...

LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models

Zhiyuan Hu, Yuliang Liu, Jinman Zhao, Suyuchen Wang, Yan Wang, Wei Shen, Qing Gu, Anh Tuan Luu, See-Kiong Ng, Zhiwei Jiang, Bryan Hooi

TL;DR

LongRecipe tackles the inefficiency of extending LLM context windows by combining impactful token analysis with position index transformation and targeted training optimizations. It demonstrates that long-context abilities can be cultivated using only a fraction of the target context window and computational resources, via pretraining data replay and model merging to preserve base capabilities. Across multiple open-source LLMs, the approach yields consistent gains for 80k–128k contexts while maintaining core general abilities, approaching GPT-4–level performance under constrained resources. The work provides a practical pathway to scalable, efficient long-context generalization and releases code to enable replication and extension.

Abstract

Large language models (LLMs) face significant challenges in handling long-context tasks because of their limited effective context window size during pretraining, which restricts their ability to generalize over extended sequences. Meanwhile, extending the context window in LLMs through post-pretraining is highly resource-intensive. To address this, we introduce LongRecipe, an efficient training strategy for extending the context window of LLMs, including impactful token analysis, position index transformation, and training optimization strategies. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model's understanding of long-range dependencies. Experiments on three types of LLMs show that LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training. Furthermore, LongRecipe also preserves the original LLM's capabilities in general tasks. Ultimately, we can extend the effective context window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory. Our code is released at https://github.com/zhiyuanhubj/LongRecipe.

LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models

TL;DR

LongRecipe tackles the inefficiency of extending LLM context windows by combining impactful token analysis with position index transformation and targeted training optimizations. It demonstrates that long-context abilities can be cultivated using only a fraction of the target context window and computational resources, via pretraining data replay and model merging to preserve base capabilities. Across multiple open-source LLMs, the approach yields consistent gains for 80k–128k contexts while maintaining core general abilities, approaching GPT-4–level performance under constrained resources. The work provides a practical pathway to scalable, efficient long-context generalization and releases code to enable replication and extension.

Abstract

Large language models (LLMs) face significant challenges in handling long-context tasks because of their limited effective context window size during pretraining, which restricts their ability to generalize over extended sequences. Meanwhile, extending the context window in LLMs through post-pretraining is highly resource-intensive. To address this, we introduce LongRecipe, an efficient training strategy for extending the context window of LLMs, including impactful token analysis, position index transformation, and training optimization strategies. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model's understanding of long-range dependencies. Experiments on three types of LLMs show that LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training. Furthermore, LongRecipe also preserves the original LLM's capabilities in general tasks. Ultimately, we can extend the effective context window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory. Our code is released at https://github.com/zhiyuanhubj/LongRecipe.
Paper Structure (29 sections, 9 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 9 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Method Overview
  • Figure 2: We conduct context window extension experiments using Llama3-8B-I with an 80k token length. Starting from 10%, which represents 8k tokens per sample, 20% corresponds to 16k tokens, 30% to 24k tokens, and 40% to 32k tokens. The 100% configuration utilizes entire long sample.
  • Figure 3: Comparison of average distance among tokens for different methods and context window.
  • Figure 4: Frequency Distribution of Parts of Speech for Tokens with Significant Logits Changes Across Text Positions. INTJ (Interjection), SYM (Symbol), ADP (Adposition), AUX (Auxiliary), PRON (Pronoun), CCONJ (Conjunction), NUM (Numeral)