Table of Contents
Fetching ...

Sentinel: Decoding Context Utilization via Attention Probing for Efficient LLM Context Compression

Yong Zhang, Heng Li, Yanwen Huang, Ning Cheng, Yang Guo, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao

TL;DR

Sentinel addresses the inefficiency of retrieval-augmented generation by decoding how an LLM internally utilizes context, rather than relying on heuristic importance scores. It uses a lightweight, fixed-protocol decoder-only proxy and a thin probing classifier trained with weak supervision to score sentence-level context signals extracted from the final decoding token, enabling query-conditioned sentence selection within a token budget. On LongBench, Sentinel achieves up to 5x context compression with QA performance comparable to 7B-scale baselines, and it generalizes across languages (English and Chinese) despite English-only probing data. The approach demonstrates that model-internal utilization signals are consistent across proxy models and sizes, offering an interpretable, data-efficient, and scalable solution for efficient RAG in real-world settings.

Abstract

Retrieval-augmented generation (RAG) often suffers from long and noisy retrieved contexts. Prior context compression methods rely on predefined importance metrics or supervised compression models, rather than on the model's own inference-time behavior. We propose Sentinel, a lightweight sentence-level compression framework that treats context compression as an understanding decoding problem. Sentinel probes native attention behaviors of a frozen LLM with a lightweight readout to decode which parts of the context are actually utilized when answering a query, rather than using attention as a direct relevance score. We empirically observe that decoded relevance signals exhibit sufficient consistency across model scales to support effective compression with compact proxy models. On LongBench, Sentinel with a 0.5B proxy model achieves up to 5x compression while matching the QA performance of 7B-scale baselines, and despite being trained only on English QA data, generalizes effectively to Chinese and out-of-domain settings.

Sentinel: Decoding Context Utilization via Attention Probing for Efficient LLM Context Compression

TL;DR

Sentinel addresses the inefficiency of retrieval-augmented generation by decoding how an LLM internally utilizes context, rather than relying on heuristic importance scores. It uses a lightweight, fixed-protocol decoder-only proxy and a thin probing classifier trained with weak supervision to score sentence-level context signals extracted from the final decoding token, enabling query-conditioned sentence selection within a token budget. On LongBench, Sentinel achieves up to 5x context compression with QA performance comparable to 7B-scale baselines, and it generalizes across languages (English and Chinese) despite English-only probing data. The approach demonstrates that model-internal utilization signals are consistent across proxy models and sizes, offering an interpretable, data-efficient, and scalable solution for efficient RAG in real-world settings.

Abstract

Retrieval-augmented generation (RAG) often suffers from long and noisy retrieved contexts. Prior context compression methods rely on predefined importance metrics or supervised compression models, rather than on the model's own inference-time behavior. We propose Sentinel, a lightweight sentence-level compression framework that treats context compression as an understanding decoding problem. Sentinel probes native attention behaviors of a frozen LLM with a lightweight readout to decode which parts of the context are actually utilized when answering a query, rather than using attention as a direct relevance score. We empirically observe that decoded relevance signals exhibit sufficient consistency across model scales to support effective compression with compact proxy models. On LongBench, Sentinel with a 0.5B proxy model achieves up to 5x compression while matching the QA performance of 7B-scale baselines, and despite being trained only on English QA data, generalizes effectively to Chinese and out-of-domain settings.

Paper Structure

This paper contains 65 sections, 1 equation, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Sentinel Framework Overview. Sentinel decodes query-aware context utilization from native attention behaviors of a frozen LLM. By probing sentence-level attention features aggregated at a single decoding step, Sentinel identifies relevant context without training compression models or performing full autoregressive generation.
  • Figure 2: Impact of proxy model family and scale on Sentinel performance under a 2k-token context (LongBench Overall AVG)
  • Figure 3: Compression ratio ablation on Qwen-2.5-7B-Instruct with a 0.5B proxy.
  • Figure 4: Comparison of attention head importance patterns identified by retrieval-head analysis (left) and Sentinel probing (right).