Language Models (Mostly) Know When to Stop Reading
Roy Xie, Junlin Wang, Paul Rosu, Chunyuan Deng, Bolun Sun, Zihao Lin, Bhuwan Dhingra
TL;DR
This work tackles the inefficiency of indiscriminate long-context processing in LLMs by introducing dynamic context cutoff, which uses internal sufficiency signals from selected attention heads to determine when enough information has been read. The approach probes model activations to train a lightweight ensemble that prompts iterative, left-to-right chunk processing with a maintained KV-cache, stopping once sufficiency is reached. Empirical results across six QA datasets and multiple model families show a 1.33x reduction in tokens with a 3.4% accuracy gain on average, outperforming static compression and RAG baselines, especially as models scale. An emergent finding is that larger models can exhibit self-assessment capabilities via prompting, reducing the need for external signals, while smaller models benefit from explicit sufficiency probes.
Abstract
Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient when the information required to answer a query is localized within the context. We present dynamic context cutoff, a novel method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that specific attention heads inherently encode "sufficiency signals" -- detectable through lightweight classifiers -- that predict when critical information has been processed. This reveals a new efficiency paradigm: models' internal understanding naturally dictates processing needs rather than external compression heuristics. Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B-70B) demonstrate 3.4% accuracy improvement while achieving 1.33x token reduction on average. Furthermore, our method demonstrates superior performance compared to other context efficiency methods at equivalent token reduction rates. Additionally, we observe an emergent scaling phenomenon: while smaller models require probing for sufficiency detection, larger models exhibit intrinsic self-assessment capabilities through prompting.
