Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation

Julia Belikova; Danila Rozhevskii; Dennis Svirin; Konstantin Polev; Alexander Panchenko

Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation

Julia Belikova, Danila Rozhevskii, Dennis Svirin, Konstantin Polev, Alexander Panchenko

TL;DR

The paper addresses overflow in soft token compression for retrieval-augmented generation, formalizing overflow as a regime where compressed representations lose enough task-relevant signal to fail queries. It introduces a progression of detection approaches from query-agnostic saturation statistics to query-aware learned probes on joint representations, showing that overflow is detectable immediately after compression without full LLM inference. Key findings include that saturation statistics distinctly separate compressed from uncompressed tokens but do not predict overflow on their own, while joint query-context representations achieve robust overflow detection (average AUC-ROC around $0.72$) prior to LLM processing. The work demonstrates the practical potential of low-cost pre-LLM gating and adaptive compression and provides a general methodology that can extend to other soft compression schemes and tasks.

Abstract

Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility -- and when compression begins to erase task-relevant content -- remain underexplored. In this paper, we define \emph{token overflow} as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.

Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation

TL;DR

) prior to LLM processing. The work demonstrates the practical potential of low-cost pre-LLM gating and adaptive compression and provides a general methodology that can extend to other soft compression schemes and tasks.

Abstract

Paper Structure (38 sections, 10 equations, 1 figure, 6 tables)

This paper contains 38 sections, 10 equations, 1 figure, 6 tables.

Introduction
Related Work
Long-context modeling and compression
Soft compression in RAG
Motivation for overflow detection
Methodology
Problem Setup
Context Complexity Measures
Token Saturation Statistics
Hoyer's sparsity
Spectral entropy
Kurtosis
Attention Features: Query-conditioned Overflow Signals
Mean attention to xRAG tokens
Attention ratios
...and 23 more sections

Figures (1)

Figure 1: Comparison of classifier architectures (Linear scikit-learn, Linear PyTorch, MLP, MLP with SCL) across datasets and feature combinations. All architectures achieve comparable performance, with differences typically $<$1 percentage point, demonstrating that overflow is largely linearly separable in joint representation space.

Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation

TL;DR

Abstract

Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (1)