Table of Contents
Fetching ...

UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference

Lang Zhou, Shuxuan Li, Zhuohao Li, Shi Liu, Zhilin Zhao, Wei-Shi Zheng

Abstract

Long-context inference remains challenging for large language models due to attention dilution and out-of-distribution degradation. Context selection mitigates this limitation by attending to a subset of key-value cache entries, yet most methods allocate a fixed context budget throughout decoding despite highly non-uniform token-level contextual demands. To address this issue, we propose Uncertainty-Triggered Adaptive Context Allocation (UT-ACA), an inference-time framework that dynamically adjusts the context window based on token-wise uncertainty. UT-ACA learns an uncertainty detector that combines semantic embeddings with logit-based confidence while accounting for uncertainty accumulation across decoding steps. When insufficient evidence is indicated, UT-ACA selectively rolls back, expands the context window, and regenerates the token with additional support. Experiments show that UT-ACA substantially reduces average context usage while preserving generation quality in long-context settings.

UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference

Abstract

Long-context inference remains challenging for large language models due to attention dilution and out-of-distribution degradation. Context selection mitigates this limitation by attending to a subset of key-value cache entries, yet most methods allocate a fixed context budget throughout decoding despite highly non-uniform token-level contextual demands. To address this issue, we propose Uncertainty-Triggered Adaptive Context Allocation (UT-ACA), an inference-time framework that dynamically adjusts the context window based on token-wise uncertainty. UT-ACA learns an uncertainty detector that combines semantic embeddings with logit-based confidence while accounting for uncertainty accumulation across decoding steps. When insufficient evidence is indicated, UT-ACA selectively rolls back, expands the context window, and regenerates the token with additional support. Experiments show that UT-ACA substantially reduces average context usage while preserving generation quality in long-context settings.
Paper Structure (35 sections, 11 equations, 9 figures, 16 tables, 1 algorithm)

This paper contains 35 sections, 11 equations, 9 figures, 16 tables, 1 algorithm.

Figures (9)

  • Figure 1: Context management techniques: (1) No Select: complete context. (2) Fixed Select: fixed-size context window. (3) Ours: adaptive context window.
  • Figure 2: Main workflow of UT-ACA. (a) The user prompt contains the long-context input information. (b) The system instruction specifies the questions or instructions. (c) The uncertainty detector takes the output logits and semantic embeddings to estimate the generation difficulty metric. (d) The adaptive context window receives the detector signal, expands the context window when needed, and triggers regeneration.
  • Figure 3: Overview of token-generation scenarios. The axes delineate the sufficiency of contextual versus intrinsic knowledge, while the logit plots depict the corresponding LLM outputs under varying conditions.
  • Figure 4: Comparison between f1-score and our conceptual accuracy score.
  • Figure 5: Latency breakdown of UT-ACA under varying MaxBudget settings (block size $= 16$). LSTM forward time values are annotated above the corresponding bars.
  • ...and 4 more figures