Latent Context Compilation: Distilling Long Context into Compact Portable Memory

Zeju Li; Yizhou Zhou; Qiang Xu

Latent Context Compilation: Distilling Long Context into Compact Portable Memory

Zeju Li, Yizhou Zhou, Qiang Xu

TL;DR

This work proposes Latent Context Compilation, a framework that fundamentally shifts context processing from adaptation to compilation, and distill long contexts into compact buffer tokens -- stateless, portable memory artifacts that are plug-and-play compatible with frozen base models.

Abstract

Efficient long-context LLM deployment is stalled by a dichotomy between amortized compression, which struggles with out-of-distribution generalization, and Test-Time Training, which incurs prohibitive synthetic data costs and requires modifying model weights, creating stateful parameters that complicate concurrent serving. We propose Latent Context Compilation, a framework that fundamentally shifts context processing from adaptation to compilation. By utilizing a disposable LoRA module as a compiler, we distill long contexts into compact buffer tokens -- stateless, portable memory artifacts that are plug-and-play compatible with frozen base models. Crucially, we introduce a self-aligned optimization strategy that eliminates the need for synthetic context-relevant QA pairs. By regularizing context reconstruction task with context-agnostic random queries, we force compressed tokens to reside within the model's existing instruction-following manifold. Experiments with Llama-3.1-8B demonstrate that Latent Context Compilation preserves fine-grained details and reasoning capabilities where prior methods falter, effectively decoupling memory density from model parameters even at a 16x compression ratio.

Latent Context Compilation: Distilling Long Context into Compact Portable Memory

TL;DR

Abstract

Paper Structure (45 sections, 4 equations, 5 figures, 3 tables)

This paper contains 45 sections, 4 equations, 5 figures, 3 tables.

Introduction
Related Work
General Context Compression
Test-Time Adaptation
Test-Time Training
Method
Theoretical Formulation
Compressive Bottleneck Architecture
Disposable LoRA as a Compression Catalyst
Optimization Strategy
Inference Strategy
Experiments
Experimental Setup
Data Construction & Hyperparameters.
Model & Training Implementation.
...and 30 more sections

Figures (5)

Figure 1: Overview of Latent Context Compilation. During Phase 1, we distill the raw context into compact Buffer Tokens using a disposable LoRA that forces information flow through the buffer. The model minimizes KL divergence against a full-context teacher. During Phase 2, the LoRA module is discarded to ensure portability. The resulting Buffer Tokens are retained as a standard KV cache, allowing the frozen LLM to perform high-fidelity inference on new queries with zero additional parameters.
Figure 2: Scaling of Manifold Regularization. Performance trajectory on Fictional Story and CoQA datasets as the quantity of context agnostic queries ($N_{Q}$) increases.
Figure 3: Training Data Ablation on CoQA. Comparison of different Repeat Data quantities (0, 500, 1000, 2000) under varying Regularization strengths ($N_{Q}$).
Figure 4: Impact of Compression Ratio on Model Performance. Evaluation on the CoQA dataset with compression ratios ranging from 2$\times$ to 32$\times$.
Figure 5: Ablation on Distillation Loss Type. Comparison between KL Divergence and MSE Loss on CoQA and Fictional Story.

Latent Context Compilation: Distilling Long Context into Compact Portable Memory

TL;DR

Abstract

Latent Context Compilation: Distilling Long Context into Compact Portable Memory

Authors

TL;DR

Abstract

Table of Contents

Figures (5)