Table of Contents
Fetching ...

Learned Structure in Cartridges: Keys as Shareable Routers in Self-Studied Representations

Maurizio Diaz

TL;DR

The paper investigates how learned Cartridges construct a compact, trainable KV cache to enable long-context inference. It finds that Cartridge keys act as stable, shareable routers across tasks, while value vectors absorb most of the compression burden, a division of labor consistent with mechanistic theories of prefix-tuning. A simple, more diverse initialization—Sampled Chunk Initialization (SCI)—significantly speeds convergence, suggesting practical improvements for scaling Cartridge training. Across model families and corpora, key vectors remain largely task-agnostic, whereas value vectors adapt to corpus-specific structure, enabling efficient, high-recall compression that scales with context length. These insights lay groundwork for broader optimization and serving strategies in prefix-tuning-based long-context inference.

Abstract

A bottleneck for long-context LLM inference is the linearly growing KV cache. Recent work has proposed Cartridges, an approach which leverages offline compute to train a much smaller KV cache than is typically required for a full document (up to 40x less memory usage at inference time). In this paper, we present the first mechanistic exploration of the learned Cartridge key-value cache structure. In particular, we propose that (1) Cartridge keys act as stable, shareable retrieval routers for the compressed corpora and (2) most of the learned compression occurs within the Cartridge value vectors. We present empirical evidence of our routing theory across tasks, model families, and model sizes; for example, we can ablate the learned Cartridge key vectors between tasks with little performance loss. Finally, we propose a slight improvement in initialization called Sampled Chunk Initialization (SCI). We suggest that SCI can lead to faster Cartridge convergence than previously demonstrated in the literature. Our findings lay the groundwork for broader empirical study of Cartridge training optimization which may be crucial for further scaling.

Learned Structure in Cartridges: Keys as Shareable Routers in Self-Studied Representations

TL;DR

The paper investigates how learned Cartridges construct a compact, trainable KV cache to enable long-context inference. It finds that Cartridge keys act as stable, shareable routers across tasks, while value vectors absorb most of the compression burden, a division of labor consistent with mechanistic theories of prefix-tuning. A simple, more diverse initialization—Sampled Chunk Initialization (SCI)—significantly speeds convergence, suggesting practical improvements for scaling Cartridge training. Across model families and corpora, key vectors remain largely task-agnostic, whereas value vectors adapt to corpus-specific structure, enabling efficient, high-recall compression that scales with context length. These insights lay groundwork for broader optimization and serving strategies in prefix-tuning-based long-context inference.

Abstract

A bottleneck for long-context LLM inference is the linearly growing KV cache. Recent work has proposed Cartridges, an approach which leverages offline compute to train a much smaller KV cache than is typically required for a full document (up to 40x less memory usage at inference time). In this paper, we present the first mechanistic exploration of the learned Cartridge key-value cache structure. In particular, we propose that (1) Cartridge keys act as stable, shareable retrieval routers for the compressed corpora and (2) most of the learned compression occurs within the Cartridge value vectors. We present empirical evidence of our routing theory across tasks, model families, and model sizes; for example, we can ablate the learned Cartridge key vectors between tasks with little performance loss. Finally, we propose a slight improvement in initialization called Sampled Chunk Initialization (SCI). We suggest that SCI can lead to faster Cartridge convergence than previously demonstrated in the literature. Our findings lay the groundwork for broader empirical study of Cartridge training optimization which may be crucial for further scaling.

Paper Structure

This paper contains 25 sections, 4 equations, 9 figures, 1 table, 2 algorithms.

Figures (9)

  • Figure 1: (Left)Cartridges learn to compress long-context documents by first generating synthetic conversations about the corpus and then training a small KV cache using the synthetic traces. This process is called Self-Study. Leveraging a context distillation objective, we back-propagate Self-Study traces into trainable KV caches while keeping the rest of the model frozen. (Right) Here we plot the layer-wise mean singular value spectra of a Llama 3.1 8BCartridge's KV vectors before and after training llama3. The resulting key vectors are stable while the learned value vectors increase in singular value, representing a more efficient use of representation space due to compression.
  • Figure 2: (Left) We reproduced a LongHealthCartridge from the original paper. To do so, we trained a Llama 3.2 3BCartridge with length $p=2048$ for 3072 optimizer updates (batch_size=64, sequence_length=1024). We checkpointed the Cartridge every 96 optimizer steps and, for all layers $l \in [1, L]$, plotted the key and value vector rotation between each checkpoint. Notably, value rotations are often a full order of magnitude larger than key rotations and they continue late into the training process. (Right) We trained two Cartridges on separate tasks: GenConvo and LongHealth from the original paper. Afterwards, we plotted the layer-wise cosine similarity between the two fully trained Cartridges and note that their learned key vectors are highly similar. We explore this further in \ref{['fig:ablations']} where we show that we can swap these key vectors with minimal downstream performance loss. On the other hand, the learned value vectors differ the most within the layers that experience the most vector rotations throughout learning.
  • Figure 3: (Left) We present three LongHealth evaluation settings: a baseline with no Cartridge (red), a model with a LongHealth-trained Cartridge (blue), and a model where we swap its LongHealth-trained Cartridge key vectors with keys from a different task (orange). We call the latter an AblationCartridge. For the Llama models, we swap in key vectors trained on GenConvo and for Qwen3 we use key vectors trained on Arxiv data qwen3. While the vector swap leads to a slight performance loss, the AblationCartridge still outperforms both a random choice baseline and the model's baseline performance. (Right) We reran our KV cache singular value analysis on Qwen3. First, we noticed that Qwen3 exhibits the same training-time value vector singular increase of the Llama family. Second, Qwen3's key vector singular values are higher variance than Llama which might explain the larger performance loss during ablation.
  • Figure 4: (Left) We plot the perplexity for 10 Llama-1BGenConvo training runs with both Sampled Chunk Initialization (SCI) and the original paper's First $k$ Tokens Initialization. Additionally, we include a random vector initialization run for comparison. For all our SCI experiments, we chose chunksize=64 because it was the midpoint in our $n$-gram diversity vs. context length analysis (\ref{['app:ngram']}). (Right) A different view on the perplexity graph, we can visualize convergence speed over our runs as a box plot. Setting a target threshold of $\text{perplexity}=1.10$ to define convergence, we can run a paired $t$-test to confirm that SCI converges at a statistically faster rate ($p<0.05$) than the original paper's First-$k$ Token Initialization scheme.
  • Figure 5: Here we plot the N-gram diversity of Sampled Chunk Initialization (SCI) vs. the First $k$ Token Initialization baseline. We note that $2^6$ is approximately the midway point when trading diversity for chunk length, so we chose 64 as the chunksize when running our experiments in \ref{['fig:convergence']}.
  • ...and 4 more figures