Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression

Haowen Hou; Fei Ma; Binwen Bai; Xinxin Zhu; Fei Yu

Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression

Haowen Hou, Fei Ma, Binwen Bai, Xinxin Zhu, Fei Yu

TL;DR

This work tackles inefficiencies in retrieval-augmented LLMs by introducing Instruction-Aware Contextual Compression (IACC), which filters retrieved content to emphasize instruction-relevant information. IACC comprises ranking-based and generation-based modules within an encoder-decoder architecture and is trained with two objectives: a ranking loss and a language modeling loss, enabling effective context reduction. Empirical results show that IACC can cut context-related costs by about 50 percent, achieve a 2.2x speedup, and reduce memory by about 5 percent, with only a small Rouge-1 decline of 0.047; generation-based IACC generally outperforms ranking-based methods, and ensembling helps further. The approach provides a practical path to more efficient RAG systems, supported by a public WikiQA-LongForm dataset and open-source code, expanding the toolkit for long-context LLMs and instruction-grounded context optimization.

Abstract

Large Language Models (LLMs) have garnered widespread attention due to their remarkable performance across various tasks. However, to mitigate the issue of hallucinations, LLMs often incorporate retrieval-augmented pipeline to provide them with rich external knowledge and context. Nevertheless, challenges stem from inaccurate and coarse-grained context retrieved from the retriever. Supplying irrelevant context to the LLMs can result in poorer responses, increased inference latency, and higher costs. This paper introduces a method called Instruction-Aware Contextual Compression, which filters out less informative content, thereby accelerating and enhancing the use of LLMs. The experimental results demonstrate that Instruction-Aware Contextual Compression notably reduces memory consumption and minimizes generation latency while maintaining performance levels comparable to those achieved with the use of the full context. Specifically, we achieved a 50% reduction in context-related costs, resulting in a 5% reduction in inference memory usage and a 2.2-fold increase in inference speed, with only a minor drop of 0.047 in Rouge-1. These findings suggest that our method strikes an effective balance between efficiency and performance.

Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression

TL;DR

Abstract

Paper Structure (25 sections, 2 equations, 5 figures, 5 tables)

This paper contains 25 sections, 2 equations, 5 figures, 5 tables.

Introduction
Related Work
Retrieval-Augmented Generation
Long Context Large Language Models
Prompt Engineering
Context Compression
Method
Model Architecture
Training Objectives
Instruction-Aware Contextual Compression by Ranking
Instruction-Aware Contextual Compression by Generation
Ensemble the two methods
Experiments
Datasets
Ranking datasets
...and 10 more sections

Figures (5)

Figure 1: Retrieval Augmented Generation(RAG) pipeline with Instruction-Aware Contextual Compression.
Figure 2: A visualisation of Instruction-Aware Contextual Compression. Deeper color indicates a stronger relevance to the instruction.
Figure 3: Model architecture of Instruction-Aware Contextual Compressor. We jointly optimize two objectives which enforce the model to extract contextual representation most relevant to the instruction.
Figure 4: Performance of Instruction-Aware Contextual Compression compared to the Selective Context baseline
Figure 5: The Impact of Generation Steps on Context Compression Effectiveness

Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression

TL;DR

Abstract

Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression

Authors

TL;DR

Abstract

Table of Contents

Figures (5)