Learned Structure in Cartridges: Keys as Shareable Routers in Self-Studied Representations
Maurizio Diaz
TL;DR
The paper investigates how learned Cartridges construct a compact, trainable KV cache to enable long-context inference. It finds that Cartridge keys act as stable, shareable routers across tasks, while value vectors absorb most of the compression burden, a division of labor consistent with mechanistic theories of prefix-tuning. A simple, more diverse initialization—Sampled Chunk Initialization (SCI)—significantly speeds convergence, suggesting practical improvements for scaling Cartridge training. Across model families and corpora, key vectors remain largely task-agnostic, whereas value vectors adapt to corpus-specific structure, enabling efficient, high-recall compression that scales with context length. These insights lay groundwork for broader optimization and serving strategies in prefix-tuning-based long-context inference.
Abstract
A bottleneck for long-context LLM inference is the linearly growing KV cache. Recent work has proposed Cartridges, an approach which leverages offline compute to train a much smaller KV cache than is typically required for a full document (up to 40x less memory usage at inference time). In this paper, we present the first mechanistic exploration of the learned Cartridge key-value cache structure. In particular, we propose that (1) Cartridge keys act as stable, shareable retrieval routers for the compressed corpora and (2) most of the learned compression occurs within the Cartridge value vectors. We present empirical evidence of our routing theory across tasks, model families, and model sizes; for example, we can ablate the learned Cartridge key vectors between tasks with little performance loss. Finally, we propose a slight improvement in initialization called Sampled Chunk Initialization (SCI). We suggest that SCI can lead to faster Cartridge convergence than previously demonstrated in the literature. Our findings lay the groundwork for broader empirical study of Cartridge training optimization which may be crucial for further scaling.
