Table of Contents
Fetching ...

Latent Context Compilation: Distilling Long Context into Compact Portable Memory

Zeju Li, Yizhou Zhou, Qiang Xu

TL;DR

This work proposes Latent Context Compilation, a framework that fundamentally shifts context processing from adaptation to compilation, and distill long contexts into compact buffer tokens -- stateless, portable memory artifacts that are plug-and-play compatible with frozen base models.

Abstract

Efficient long-context LLM deployment is stalled by a dichotomy between amortized compression, which struggles with out-of-distribution generalization, and Test-Time Training, which incurs prohibitive synthetic data costs and requires modifying model weights, creating stateful parameters that complicate concurrent serving. We propose Latent Context Compilation, a framework that fundamentally shifts context processing from adaptation to compilation. By utilizing a disposable LoRA module as a compiler, we distill long contexts into compact buffer tokens -- stateless, portable memory artifacts that are plug-and-play compatible with frozen base models. Crucially, we introduce a self-aligned optimization strategy that eliminates the need for synthetic context-relevant QA pairs. By regularizing context reconstruction task with context-agnostic random queries, we force compressed tokens to reside within the model's existing instruction-following manifold. Experiments with Llama-3.1-8B demonstrate that Latent Context Compilation preserves fine-grained details and reasoning capabilities where prior methods falter, effectively decoupling memory density from model parameters even at a 16x compression ratio.

Latent Context Compilation: Distilling Long Context into Compact Portable Memory

TL;DR

This work proposes Latent Context Compilation, a framework that fundamentally shifts context processing from adaptation to compilation, and distill long contexts into compact buffer tokens -- stateless, portable memory artifacts that are plug-and-play compatible with frozen base models.

Abstract

Efficient long-context LLM deployment is stalled by a dichotomy between amortized compression, which struggles with out-of-distribution generalization, and Test-Time Training, which incurs prohibitive synthetic data costs and requires modifying model weights, creating stateful parameters that complicate concurrent serving. We propose Latent Context Compilation, a framework that fundamentally shifts context processing from adaptation to compilation. By utilizing a disposable LoRA module as a compiler, we distill long contexts into compact buffer tokens -- stateless, portable memory artifacts that are plug-and-play compatible with frozen base models. Crucially, we introduce a self-aligned optimization strategy that eliminates the need for synthetic context-relevant QA pairs. By regularizing context reconstruction task with context-agnostic random queries, we force compressed tokens to reside within the model's existing instruction-following manifold. Experiments with Llama-3.1-8B demonstrate that Latent Context Compilation preserves fine-grained details and reasoning capabilities where prior methods falter, effectively decoupling memory density from model parameters even at a 16x compression ratio.
Paper Structure (45 sections, 4 equations, 5 figures, 3 tables)

This paper contains 45 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of Latent Context Compilation. During Phase 1, we distill the raw context into compact Buffer Tokens using a disposable LoRA that forces information flow through the buffer. The model minimizes KL divergence against a full-context teacher. During Phase 2, the LoRA module is discarded to ensure portability. The resulting Buffer Tokens are retained as a standard KV cache, allowing the frozen LLM to perform high-fidelity inference on new queries with zero additional parameters.
  • Figure 2: Scaling of Manifold Regularization. Performance trajectory on Fictional Story and CoQA datasets as the quantity of context agnostic queries ($N_{Q}$) increases.
  • Figure 3: Training Data Ablation on CoQA. Comparison of different Repeat Data quantities (0, 500, 1000, 2000) under varying Regularization strengths ($N_{Q}$).
  • Figure 4: Impact of Compression Ratio on Model Performance. Evaluation on the CoQA dataset with compression ratios ranging from 2$\times$ to 32$\times$.
  • Figure 5: Ablation on Distillation Loss Type. Comparison between KL Divergence and MSE Loss on CoQA and Fictional Story.