Table of Contents
Fetching ...

Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning

Adnan Oomerjee, Zafeirios Fountas, Haitham Bou-Ammar, Jun Wang

Abstract

Transformer LLMs have been shown to exhibit strong reasoning ability that scales with inference-time compute, most prominently through token-space "thinking" chains of thought. A growing line of work pushes extra computation into the model's latent space, which we term Auxiliary Latent-Space Computation (ALSC). Existing ALSC methods largely fall into three buckets: (i) token-mediated latent rollouts, (ii) residual/activation steering, and (iii) memory (KV) compression. An underexplored alternative is memory consolidation/reconsolidation, two processes in the brain that are responsible for stabilising newly formed memory traces, and, upon recall, transiently rendering established traces plastic such they can integrate new contextual information before restabilising. In Transformer LLMs, this can be seen as analogous to performing in-place rewrites of new KV segments, and rewrites of recalled past segments. In this work, we give a theoretical justification as to why memory (re)consolidation via KV cache rewrites is beneficial for improved reasoning. We do this through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input information compression and retention of predictive information in latent representations. We then introduce the Bottlenecked Transformer, which augments a backbone LLM with a Cache Processor, an auxiliary Transformer that performs periodic, non-causal, in-place KV rewrites at newline-delimited reasoning step boundaries. The Processor consolidates recently written KV entries and reconsolidates a small, top-k attention-selected set of prior entries. We evaluate our Bottlenecked Transformer architecture on math reasoning benchmarks. Our model sees consistent performance gains over vanilla Transformers and pause-token augmented baselines, with gains of up to +6.6pp for selected tasks/backbones.

Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning

Abstract

Transformer LLMs have been shown to exhibit strong reasoning ability that scales with inference-time compute, most prominently through token-space "thinking" chains of thought. A growing line of work pushes extra computation into the model's latent space, which we term Auxiliary Latent-Space Computation (ALSC). Existing ALSC methods largely fall into three buckets: (i) token-mediated latent rollouts, (ii) residual/activation steering, and (iii) memory (KV) compression. An underexplored alternative is memory consolidation/reconsolidation, two processes in the brain that are responsible for stabilising newly formed memory traces, and, upon recall, transiently rendering established traces plastic such they can integrate new contextual information before restabilising. In Transformer LLMs, this can be seen as analogous to performing in-place rewrites of new KV segments, and rewrites of recalled past segments. In this work, we give a theoretical justification as to why memory (re)consolidation via KV cache rewrites is beneficial for improved reasoning. We do this through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input information compression and retention of predictive information in latent representations. We then introduce the Bottlenecked Transformer, which augments a backbone LLM with a Cache Processor, an auxiliary Transformer that performs periodic, non-causal, in-place KV rewrites at newline-delimited reasoning step boundaries. The Processor consolidates recently written KV entries and reconsolidates a small, top-k attention-selected set of prior entries. We evaluate our Bottlenecked Transformer architecture on math reasoning benchmarks. Our model sees consistent performance gains over vanilla Transformers and pause-token augmented baselines, with gains of up to +6.6pp for selected tasks/backbones.

Paper Structure

This paper contains 43 sections, 3 theorems, 23 equations, 5 figures, 4 tables.

Key Result

Lemma 4.1

Let $\mathcal{M}_\theta$ be a model parameterized by $\theta$, with input/output variables $(X, Y)$, and let $\mathcal{Z}^{\mathcal{M}_\theta}$ be the set of information bottlenecks in $\mathcal{M}_\theta$, with $\hat{Z}$ defining the terminal bottleneck in $\mathcal{M}_\theta$. Then $I(X;Z) \geq I(

Figures (5)

  • Figure 1: Bottlenecked Transformer architecture, consisting of a backbone LLM processing/generating tokens, and Transformer Cache Processor that rewrites KV entries. The Cache Processor is invoked each time a newline token is generated (marking the end of a reasoning step). When invoked, recent tokens (from the recent step window in grey) and $k$ retrieved tokens beyond the RSW (in blue) are passed in parallel to the Cache Processor, and rewritten in-place.
  • Figure 2: Conceptual illustration. (A) Bottlenecked Transformers balance input compression $I(X;Z)$ with predictive information $I(Z;Y)$ for high generalisation. (B) This achieves superior predictive efficiency $I(Z;Y)/I(X;Z)$ vs. capacity over other methods.
  • Figure 3: Epoch-matched comparison of SFT@$N$ and Bottleneck@$N$ across seven tasks. The backbone is SFT-trained for 8 epochs with per-epoch checkpoints; Bottleneck@$N$ uses checkpoint $N{-}1$ plus one Processor epoch, and curves plot accuracy versus total epochs $N$. The red $\times$ marks the highest score for each task across both model variants and all $N$.
  • Figure 4: Cache Processor rewrite magnitudes on GSM8K. Left: per-invocation mean distances for top-$k$, recent-step window, and all rewritten tokens. Right: layer–head heatmaps of mean cosine distance between pre- and post-Processor value vectors.
  • Figure 5: Size ablation of the Cache Processor on a frozen Llama 3.2 1B backbone, showing per-epoch performance of each variant on each task.

Theorems & Definitions (9)

  • Definition 4.1: Neural Information Bottleneck
  • Definition 4.2: Ordering of Bottlenecks in Neural Networks
  • Definition 4.3: Terminal Bottleneck
  • Lemma 4.1
  • Theorem 4.1: KV-Cache and Final Hidden State as Seq-to-Seq Terminal Bottleneck
  • Theorem 4.2: Autoregressive Training Encourages high $I(S_{0:n};\hat{Z})$ and $I(\hat{Z};S_{n+1})$
  • proof
  • proof
  • proof