Table of Contents
Fetching ...

Language Models (Mostly) Know When to Stop Reading

Roy Xie, Junlin Wang, Paul Rosu, Chunyuan Deng, Bolun Sun, Zihao Lin, Bhuwan Dhingra

TL;DR

This work tackles the inefficiency of indiscriminate long-context processing in LLMs by introducing dynamic context cutoff, which uses internal sufficiency signals from selected attention heads to determine when enough information has been read. The approach probes model activations to train a lightweight ensemble that prompts iterative, left-to-right chunk processing with a maintained KV-cache, stopping once sufficiency is reached. Empirical results across six QA datasets and multiple model families show a 1.33x reduction in tokens with a 3.4% accuracy gain on average, outperforming static compression and RAG baselines, especially as models scale. An emergent finding is that larger models can exhibit self-assessment capabilities via prompting, reducing the need for external signals, while smaller models benefit from explicit sufficiency probes.

Abstract

Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient when the information required to answer a query is localized within the context. We present dynamic context cutoff, a novel method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that specific attention heads inherently encode "sufficiency signals" -- detectable through lightweight classifiers -- that predict when critical information has been processed. This reveals a new efficiency paradigm: models' internal understanding naturally dictates processing needs rather than external compression heuristics. Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B-70B) demonstrate 3.4% accuracy improvement while achieving 1.33x token reduction on average. Furthermore, our method demonstrates superior performance compared to other context efficiency methods at equivalent token reduction rates. Additionally, we observe an emergent scaling phenomenon: while smaller models require probing for sufficiency detection, larger models exhibit intrinsic self-assessment capabilities through prompting.

Language Models (Mostly) Know When to Stop Reading

TL;DR

This work tackles the inefficiency of indiscriminate long-context processing in LLMs by introducing dynamic context cutoff, which uses internal sufficiency signals from selected attention heads to determine when enough information has been read. The approach probes model activations to train a lightweight ensemble that prompts iterative, left-to-right chunk processing with a maintained KV-cache, stopping once sufficiency is reached. Empirical results across six QA datasets and multiple model families show a 1.33x reduction in tokens with a 3.4% accuracy gain on average, outperforming static compression and RAG baselines, especially as models scale. An emergent finding is that larger models can exhibit self-assessment capabilities via prompting, reducing the need for external signals, while smaller models benefit from explicit sufficiency probes.

Abstract

Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient when the information required to answer a query is localized within the context. We present dynamic context cutoff, a novel method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that specific attention heads inherently encode "sufficiency signals" -- detectable through lightweight classifiers -- that predict when critical information has been processed. This reveals a new efficiency paradigm: models' internal understanding naturally dictates processing needs rather than external compression heuristics. Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B-70B) demonstrate 3.4% accuracy improvement while achieving 1.33x token reduction on average. Furthermore, our method demonstrates superior performance compared to other context efficiency methods at equivalent token reduction rates. Additionally, we observe an emergent scaling phenomenon: while smaller models require probing for sufficiency detection, larger models exhibit intrinsic self-assessment capabilities through prompting.

Paper Structure

This paper contains 55 sections, 7 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Our method enables language models to perform early termination by detecting sufficiency signals in key attention heads, reducing the amount of processed content while preserving performance.
  • Figure 2: Our method leverages the model's internal representations to identify when sufficient information has been processed. A lightweight classifier is trained on selected attention heads to detect context sufficiency, leading to token savings while improving task performance.
  • Figure 3: Validation F1 scores for linear probes across all attention heads in LLaMA3.2-1B, sorted row-wise by F1. Darker blue represents higher F1 scores. Some heads show significantly higher performance. More visualizations can be found in \ref{['fig:14B_heatmap']}.
  • Figure 4: Our method achieves superior efficiency-accuracy trade-offs compared to baselines. RAG degrades with scale, while Lingua2 remains competitive but lags on multihop tasks. Larger models (14B+) exhibit emergent self-awareness on context sufficiency through prompting.
  • Figure 5: Confidence progression across context chunks. Model's prediction confidence increases monotonically with more context.
  • ...and 6 more figures