Table of Contents
Fetching ...

Beyond Hard and Soft: Hybrid Context Compression for Balancing Local and Global Information Retention

Huanxuan Liao, Wen Hu, Yao Xu, Shizhu He, Jun Zhao, Kang Liu

TL;DR

HyCo2 introduces a hybrid context compression framework that balances local detail preservation and global semantic retention to tackle long-context inference in LLMs. It combines a soft global refinement via a hybrid adapter (merging MLP and QFormer) with a hard local token-retention classifier, and it trains these components through a three-stage alternating process (paraphrase pretraining, completion pretraining, and instruction tuning with KL distillation). The approach yields substantial improvements in long-text reasoning, achieves high token efficiency (up to 88.8% reduction), and, in many settings, matches or surpasses uncompressed performance while using far fewer parameters. This work offers a practical, scalable path for deploying long-context capable LLMs with reduced compute and memory demands, enabling more efficient retrieval-augmented and multi-document reasoning systems.

Abstract

Large Language Models (LLMs) encounter significant challenges in long-sequence inference due to computational inefficiency and redundant processing, driving interest in context compression techniques. Existing methods often rely on token importance to perform hard local compression or encode context into latent representations for soft global compression. However, the uneven distribution of textual content relevance and the diversity of demands for user instructions mean these approaches frequently lead to the loss of potentially valuable information. To address this, we propose $\textbf{Hy}$brid $\textbf{Co}$ntext $\textbf{Co}$mpression (HyCo$_2$) for LLMs, which integrates both global and local perspectives to guide context compression while retaining both the essential semantics and critical details for task completion. Specifically, we employ a hybrid adapter to refine global semantics with the global view, based on the observation that different adapters excel at different tasks. Then we incorporate a classification layer that assigns a retention probability to each context token based on the local view, determining whether it should be retained or discarded. To foster a balanced integration of global and local compression, we introduce auxiliary paraphrasing and completion pretraining before instruction tuning. This promotes a synergistic integration that emphasizes instruction-relevant information while preserving essential local details, ultimately balancing local and global information retention in context compression. Experiments show that our HyCo$_2$ method significantly enhances long-text reasoning while reducing token usage. It improves the performance of various LLM series by an average of 13.1\% across seven knowledge-intensive QA benchmarks. Moreover, HyCo$_2$ matches the performance of uncompressed methods while reducing token consumption by 88.8\%.

Beyond Hard and Soft: Hybrid Context Compression for Balancing Local and Global Information Retention

TL;DR

HyCo2 introduces a hybrid context compression framework that balances local detail preservation and global semantic retention to tackle long-context inference in LLMs. It combines a soft global refinement via a hybrid adapter (merging MLP and QFormer) with a hard local token-retention classifier, and it trains these components through a three-stage alternating process (paraphrase pretraining, completion pretraining, and instruction tuning with KL distillation). The approach yields substantial improvements in long-text reasoning, achieves high token efficiency (up to 88.8% reduction), and, in many settings, matches or surpasses uncompressed performance while using far fewer parameters. This work offers a practical, scalable path for deploying long-context capable LLMs with reduced compute and memory demands, enabling more efficient retrieval-augmented and multi-document reasoning systems.

Abstract

Large Language Models (LLMs) encounter significant challenges in long-sequence inference due to computational inefficiency and redundant processing, driving interest in context compression techniques. Existing methods often rely on token importance to perform hard local compression or encode context into latent representations for soft global compression. However, the uneven distribution of textual content relevance and the diversity of demands for user instructions mean these approaches frequently lead to the loss of potentially valuable information. To address this, we propose brid ntext mpression (HyCo) for LLMs, which integrates both global and local perspectives to guide context compression while retaining both the essential semantics and critical details for task completion. Specifically, we employ a hybrid adapter to refine global semantics with the global view, based on the observation that different adapters excel at different tasks. Then we incorporate a classification layer that assigns a retention probability to each context token based on the local view, determining whether it should be retained or discarded. To foster a balanced integration of global and local compression, we introduce auxiliary paraphrasing and completion pretraining before instruction tuning. This promotes a synergistic integration that emphasizes instruction-relevant information while preserving essential local details, ultimately balancing local and global information retention in context compression. Experiments show that our HyCo method significantly enhances long-text reasoning while reducing token usage. It improves the performance of various LLM series by an average of 13.1\% across seven knowledge-intensive QA benchmarks. Moreover, HyCo matches the performance of uncompressed methods while reducing token consumption by 88.8\%.

Paper Structure

This paper contains 29 sections, 8 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Different paradigms for processing long-text inputs: (a) original input, (b) hard compression, (c) soft compression and (d) our hybrid compression. We categorize representative methods under each paradigm and evaluate them based on three criteria: local details (whether retains important local details), global semantics (whether facilitates understanding of overall context), and inference cost (whether reduces memory usage and inference latency).
  • Figure 2: (a)Hybrid Context Compression Framework. We employ a classification layer for local tokens selection and use a hybrid adapter to extract instruction-relevant representation. Additionally, a router optimizes the global context through soft integration, thereby optimizing overall context representation. (b)Alternating Training Method. (1) Refining the hybrid adapter with paraphrase pretraining, (2) optimizing the classification layer with completion pretraining and (3) instruction tuning for both the hybrid adapter and the classification layer.
  • Figure 3: Significance of Soft MoE. The reported values represent the performance ratio of baselines to the best one: Gate.
  • Figure 4: We employ Mistral-7B to investigate two aspects: (a) a four-dimensional comparison of information preservation between HyCo$_2$ and xRAG following context compression and reconstruction, and (b) the performance trends of various compression methods as context length increases. BERTScore measures semantic similarity, Information Loss measures the entropy value of discarded information, while Readability and ROUGE-L evaluate the quality of the reconstructed context.