Table of Contents
Fetching ...

Extending Context Window of Large Language Models via Semantic Compression

Weizhi Fei, Xueyan Niu, Pingyi Zhou, Lu Hou, Bo Bai, Lei Deng, Wei Han

TL;DR

<p>The paper tackles the limited context window of large language models by introducing semantic compression, a lossless-leaning, source-coding-inspired step that shortens long inputs while preserving semantic content. It uses a graph-based, topic-aware chunking approach to segment and compress each topic block with pre-trained summarizers, acting as a plug-in module that requires no parameter updates. The method achieves 6-8x context extension and remains compatible with interpolation-based techniques to push even further, while maintaining fluency and reducing computational costs. Empirical results across passkey retrieval, long-document QA, summarization, and other long-context tasks demonstrate robust performance and practical applicability on standard long-text benchmarks.</p>

Abstract

Transformer-based Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses. This constraint restricts their applicability in scenarios involving long texts. We propose a novel semantic compression method that enables generalization to texts that are 6-8 times longer, without incurring significant computational costs or requiring fine-tuning. Our proposed framework draws inspiration from source coding in information theory and employs a pre-trained model to reduce the semantic redundancy of long inputs before passing them to the LLMs for downstream tasks. Experimental results demonstrate that our method effectively extends the context window of LLMs across a range of tasks including question answering, summarization, few-shot learning, and information retrieval. Furthermore, the proposed semantic compression method exhibits consistent fluency in text generation while reducing the associated computational overhead.

Extending Context Window of Large Language Models via Semantic Compression

TL;DR

<p>The paper tackles the limited context window of large language models by introducing semantic compression, a lossless-leaning, source-coding-inspired step that shortens long inputs while preserving semantic content. It uses a graph-based, topic-aware chunking approach to segment and compress each topic block with pre-trained summarizers, acting as a plug-in module that requires no parameter updates. The method achieves 6-8x context extension and remains compatible with interpolation-based techniques to push even further, while maintaining fluency and reducing computational costs. Empirical results across passkey retrieval, long-document QA, summarization, and other long-context tasks demonstrate robust performance and practical applicability on standard long-text benchmarks.</p>

Abstract

Transformer-based Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses. This constraint restricts their applicability in scenarios involving long texts. We propose a novel semantic compression method that enables generalization to texts that are 6-8 times longer, without incurring significant computational costs or requiring fine-tuning. Our proposed framework draws inspiration from source coding in information theory and employs a pre-trained model to reduce the semantic redundancy of long inputs before passing them to the LLMs for downstream tasks. Experimental results demonstrate that our method effectively extends the context window of LLMs across a range of tasks including question answering, summarization, few-shot learning, and information retrieval. Furthermore, the proposed semantic compression method exhibits consistent fluency in text generation while reducing the associated computational overhead.
Paper Structure (31 sections, 3 equations, 6 figures, 1 table)

This paper contains 31 sections, 3 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: With the inclusion of the semantic compression module, the redundancies in the input are eliminated, thereby effectively extending the context window. The semantic compression is reminiscent of the concept of source coding in information theory.
  • Figure 2: An illustration of our semantic compression method. The input text is initially segmented into topic-based chunks, utilizing the graph representation. Subsequently, these chunks undergo refinement using pre-trained models to ensure the preservation of key information. Finally, the refined chunks are assembled in accordance with the original order. The resulting texts, which have been semantically compressed, are approximately 6-8 times shorter in length compared to the original input. Consequently, they fall within the context window of the LLMs. Furthermore, for additional extension of the length, other methods such as extrapolation and interpolation-based techniques can be concatenated.
  • Figure 3: Example of synthetic prompt for the passkey retrieval task mohtashami2023landmark. The pre-trained LLM is incapable of processing long input due to the context length constraint. By applying semantic compression, the redundant information in the long document is removed, and the compressed input retains essential key information. The LLM can then process the compressed input along with the prompt to generate the accurate answer. Notably, the distinct colors used in the illustration correspond to topic-based chunks.
  • Figure 4: Perplexity on the GovReport dataset was evaluated at different sequence lengths. The perplexity curves of Llama2 (green) and our method (purple) exhibit similar trends for sequences up to 4k in length. However, as the sequence length exceeds the training length of 4k, our method effectively flattens the perplexity curve, indicating that fluency is preserved for longer sequences.
  • Figure 5: Comparison between model variants on the passkey retrieval task. The retrieval accuracy of the Llama2 baseline (green) drops to zero at about 5k due to out-of-memory issues. Our method (purple) successfully extends the length to 30k. Moreover, when combined with SoTA extrapolation-based method YaRN, the context length can be further extended to over 60k ensuring that the retrieval accuracy remains consistently above 90%.
  • ...and 1 more figures