Table of Contents
Fetching ...

Stacked from One: Multi-Scale Self-Injection for Context Window Extension

Wei Han, Pan Zhou, Shuicheng Yan

TL;DR

Across a comprehensive suite of long-context modeling and understanding benchmarks, the proposed~\modelname~achieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy.

Abstract

The limited context window of contemporary large language models (LLMs) remains a primary bottleneck for their broader application across diverse domains. Although continual pre-training on long-context data offers a straightforward solution, it incurs prohibitive data acquisition and computational costs. To address this challenge, we propose~\modelname, a novel framework based on multi-grained context compression and query-aware information acquisition. SharedLLM comprises two stacked short-context LLMs: a lower model serving as a compressor and an upper model acting as a decoder. The lower model compresses long inputs into compact, multi-grained representations, which are then forwarded to the upper model for context-aware processing. To maximize efficiency, this information transfer occurs exclusively at the lowest layers, bypassing lengthy forward passes and redundant cross-attention operations. This entire process, wherein the upper and lower models are derived from the same underlying LLM layers, is termed~\textit{self-injection}. To support this architecture, a specialized tree-based data structure enables the efficient encoding and query-aware retrieval of contextual information. Despite being trained on sequences of only 8K tokens, \modelname~effectively generalizes to inputs exceeding 128K tokens. Across a comprehensive suite of long-context modeling and understanding benchmarks, \modelname~achieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy. Furthermore, these design choices allow \modelname~to substantially reduce the memory footprint and yield notable inference speedups ($2\times$ over streaming and $3\times$ over encoder-decoder architectures).

Stacked from One: Multi-Scale Self-Injection for Context Window Extension

TL;DR

Across a comprehensive suite of long-context modeling and understanding benchmarks, the proposed~\modelname~achieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy.

Abstract

The limited context window of contemporary large language models (LLMs) remains a primary bottleneck for their broader application across diverse domains. Although continual pre-training on long-context data offers a straightforward solution, it incurs prohibitive data acquisition and computational costs. To address this challenge, we propose~\modelname, a novel framework based on multi-grained context compression and query-aware information acquisition. SharedLLM comprises two stacked short-context LLMs: a lower model serving as a compressor and an upper model acting as a decoder. The lower model compresses long inputs into compact, multi-grained representations, which are then forwarded to the upper model for context-aware processing. To maximize efficiency, this information transfer occurs exclusively at the lowest layers, bypassing lengthy forward passes and redundant cross-attention operations. This entire process, wherein the upper and lower models are derived from the same underlying LLM layers, is termed~\textit{self-injection}. To support this architecture, a specialized tree-based data structure enables the efficient encoding and query-aware retrieval of contextual information. Despite being trained on sequences of only 8K tokens, \modelname~effectively generalizes to inputs exceeding 128K tokens. Across a comprehensive suite of long-context modeling and understanding benchmarks, \modelname~achieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy. Furthermore, these design choices allow \modelname~to substantially reduce the memory footprint and yield notable inference speedups ( over streaming and over encoder-decoder architectures).
Paper Structure (42 sections, 8 equations, 6 figures, 13 tables, 1 algorithm)

This paper contains 42 sections, 8 equations, 6 figures, 13 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of SharedLLM. The architecture resembles general encoder-decoder architecture like T5 raffel2020exploring, but the interaction occurs at the first $M$ layers between lower and upper model through shared key-values which are encoded and compressed from the text chunk into a sequence of trees (top-left).
  • Figure 2: An running example of our tree (depth=3). Each box indexed by $i$ represents the $i$th iteration of node split and selection.
  • Figure 3: Comparison of memory usage (left) and total inference time on 100 examples (right) between SharedLLM and other training time baseline methods. The data is collected by running a tiny experiment on 100 examples in corresponding lengths. "OOM" means out-of-memory exception triggered during test time.
  • Figure 4: Ablative Studies on different configurations of structural information injection. The best values in each category and settings consistent with our defaults are highlighted in bold.
  • Figure 5: Accuracy comparison on passkey retrieval (single key-value pair) task.
  • ...and 1 more figures