Table of Contents
Fetching ...

LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation

Zican Dong, Junyi Li, Jinhao Jiang, Mingyu Xu, Wayne Xin Zhao, Bingning Wang, Weipeng Chen

TL;DR

This work identifies distribution drift and catastrophic forgetting as the root causes of short-text degradation when extending LLM context windows. It introduces LongReD, a restoration-distillation framework that combines long-text training with short-text distillation and short-to-long distillation, guided by selectively distilling hidden states and output distributions using skipped positional indices. Empirical results on Llama-3-8B and Mistral-7B-v0.3 show that LongReD preserves short-text performance near the original model while maintaining or improving long-context capabilities, outperforming basic continual pre-training and certain continual-learning baselines. The approach demonstrates that compatible distillation objectives and skipped-position techniques can effectively bridge short- and long-text processing in extended-context LLMs, with practical efficiency considerations.

Abstract

Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also introduces a short-to-long distillation, aligning the output distribution on short texts with that on long texts by leveraging skipped positional indices. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance while maintaining comparable or even better capacity to handle long texts than baselines. Our code is available at https://github.com/RUCAIBox/LongReD.

LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation

TL;DR

This work identifies distribution drift and catastrophic forgetting as the root causes of short-text degradation when extending LLM context windows. It introduces LongReD, a restoration-distillation framework that combines long-text training with short-text distillation and short-to-long distillation, guided by selectively distilling hidden states and output distributions using skipped positional indices. Empirical results on Llama-3-8B and Mistral-7B-v0.3 show that LongReD preserves short-text performance near the original model while maintaining or improving long-context capabilities, outperforming basic continual pre-training and certain continual-learning baselines. The approach demonstrates that compatible distillation objectives and skipped-position techniques can effectively bridge short- and long-text processing in extended-context LLMs, with practical efficiency considerations.

Abstract

Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also introduces a short-to-long distillation, aligning the output distribution on short texts with that on long texts by leveraging skipped positional indices. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance while maintaining comparable or even better capacity to handle long texts than baselines. Our code is available at https://github.com/RUCAIBox/LongReD.

Paper Structure

This paper contains 31 sections, 20 equations, 5 figures, 16 tables.

Figures (5)

  • Figure 1: Comparison of the original and long-context models via ABF xiong-naacl-2024-effective or PI chen-arxiv-2023-extending on common short-text benchmarks.
  • Figure 2: Relationship between MMLU performance preservation of long-context models w.r.t. the hidden states similarity.
  • Figure 3: Results of models with different training steps.
  • Figure 4: Overview of our proposed Long Context Pre-training with Restoration Distillation (LongReD). The method consists of three parts, i.e., long-text training, short-text distillation, and short-to-long distillation.
  • Figure 5: Simialrity matrices of positional vectors inside and outside the context window.