Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation
Julia Belikova, Danila Rozhevskii, Dennis Svirin, Konstantin Polev, Alexander Panchenko
TL;DR
The paper addresses overflow in soft token compression for retrieval-augmented generation, formalizing overflow as a regime where compressed representations lose enough task-relevant signal to fail queries. It introduces a progression of detection approaches from query-agnostic saturation statistics to query-aware learned probes on joint representations, showing that overflow is detectable immediately after compression without full LLM inference. Key findings include that saturation statistics distinctly separate compressed from uncompressed tokens but do not predict overflow on their own, while joint query-context representations achieve robust overflow detection (average AUC-ROC around $0.72$) prior to LLM processing. The work demonstrates the practical potential of low-cost pre-LLM gating and adaptive compression and provides a general methodology that can extend to other soft compression schemes and tasks.
Abstract
Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility -- and when compression begins to erase task-relevant content -- remain underexplored. In this paper, we define \emph{token overflow} as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.
