Table of Contents
Fetching ...

KaVa: Latent Reasoning via Compressed KV-Cache Distillation

Anna Kuzina, Maciej Pioro, Paul N. Whatmough, Babak Ehteshami Bejnordi

TL;DR

KaVa tackles the high cost of chain-of-thought reasoning by learning latent internal reasoning through distillation from a compressed teacher KV-cache. The framework combines a three-way pipeline with a redundancy-aware KV eviction and a KV matching loss, augmented by CODI self-distillation, enabling the latent student to mimic the teacher's internal dynamics without generating verbose traces. Empirically, KaVa outperforms strong latent baselines on natural-language reasoning tasks, shows smaller degradation when shifting from equation-like to NL traces, and scales to larger backbones while maintaining efficiency. This work establishes compressed KV-cache distillation as a scalable supervision signal that merges the accuracy of CoT-trained teachers with the deployability of latent inference.

Abstract

Large Language Models (LLMs) excel at multi-step reasoning problems with explicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifacts. Latent reasoning has emerged as an efficient alternative that internalizes the thought process, but it suffers from a critical lack of supervision, limiting its effectiveness on complex, natural-language reasoning traces. In this work, we propose KaVa, the first framework that bridges this gap by distilling knowledge directly from a compressed KV-cache of the teacher into a latent-reasoning student via self-distillation, leveraging the representational flexibility of continuous latent tokens to align stepwise KV trajectories. We show that the abstract, unstructured knowledge within compressed KV-cache, which lacks direct token correspondence, can serve as a rich supervisory signal for a latent reasoning student. Empirically, the approach consistently outperforms strong latent baselines, exhibits markedly smaller degradation from equation-only to natural-language traces, and scales to larger backbones while preserving efficiency. These results establish compressed KV-cache distillation as a scalable supervision signal for latent reasoning, combining the accuracy of CoT-trained teachers with the efficiency and deployability of latent inference.

KaVa: Latent Reasoning via Compressed KV-Cache Distillation

TL;DR

KaVa tackles the high cost of chain-of-thought reasoning by learning latent internal reasoning through distillation from a compressed teacher KV-cache. The framework combines a three-way pipeline with a redundancy-aware KV eviction and a KV matching loss, augmented by CODI self-distillation, enabling the latent student to mimic the teacher's internal dynamics without generating verbose traces. Empirically, KaVa outperforms strong latent baselines on natural-language reasoning tasks, shows smaller degradation when shifting from equation-like to NL traces, and scales to larger backbones while maintaining efficiency. This work establishes compressed KV-cache distillation as a scalable supervision signal that merges the accuracy of CoT-trained teachers with the deployability of latent inference.

Abstract

Large Language Models (LLMs) excel at multi-step reasoning problems with explicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifacts. Latent reasoning has emerged as an efficient alternative that internalizes the thought process, but it suffers from a critical lack of supervision, limiting its effectiveness on complex, natural-language reasoning traces. In this work, we propose KaVa, the first framework that bridges this gap by distilling knowledge directly from a compressed KV-cache of the teacher into a latent-reasoning student via self-distillation, leveraging the representational flexibility of continuous latent tokens to align stepwise KV trajectories. We show that the abstract, unstructured knowledge within compressed KV-cache, which lacks direct token correspondence, can serve as a rich supervisory signal for a latent reasoning student. Empirically, the approach consistently outperforms strong latent baselines, exhibits markedly smaller degradation from equation-only to natural-language traces, and scales to larger backbones while preserving efficiency. These results establish compressed KV-cache distillation as a scalable supervision signal for latent reasoning, combining the accuracy of CoT-trained teachers with the efficiency and deployability of latent inference.

Paper Structure

This paper contains 34 sections, 7 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: We propose KaVa , a latent reasoning model with KV-cache distillation loss. (a) Overview of our proposed compressed KV-cache distilled latent reasoning framework. (b) Teacher builds full KV-cache from a ground-truth CoT trace; a compression module produces a compressed cache to match the length of the latent trace; (c) a latent‑reasoning student generates continuous thoughts $z_t$ and is trained to match compressed teacher KV at each layer/step via KV distillation.
  • Figure 2: Graphical model of the latent reasoning generative model. The question prompt is used to generate continuous latent thought ${\textnormal{Z}}$. The answer tokens are generated from the question and latent reasoning trace.
  • Figure 3: During training the student predicts the answer using latent tokens, teacher has the access to the full reasoning trace, and KV matching distills the information from the full to the latent CoT.
  • Figure 7: Cosine similarity of Keys in the latent CoT with Keys of the ground truth averaged across heads and layers. We use the same prompt and ground truth CoT as in Table \ref{['tab:decoding']}.
  • Figure 8: Cosine similarity of Values in the latent CoT with Values of the ground truth averaged across heads and layers. We use the same prompt and ground truth CoT as in Table \ref{['tab:decoding']}.
  • ...and 4 more figures