Table of Contents
Fetching ...

Compressed Context Memory For Online Language Model Interaction

Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song

TL;DR

The paper tackles the challenge of online language model inference with ever-expanding context by introducing Compressed Context Memory (CCM), a memory-augmented framework that compresses accumulating attention key/value pairs into a compact Mem(t) using a dedicated COMP token. It combines a lightweight conditional LoRA adapter with two memory schemes (CCM-concat and CCM-merge) and a parallelized training strategy to enable efficient, end-to-end optimization without full fine-tuning. Empirical results across conversation, personalization, and multi-task learning show CCM can reach full-context performance with substantially smaller memory footprints, and it outperforms sliding-window baselines in streaming settings. The approach demonstrates strong generalization with a unified compression adapter and favorable memory/throughput trade-offs, offering practical benefits for memory-constrained online deployments of large language models.

Abstract

This paper presents a context key/value compression method for Transformer language models in online scenarios, where the context continually expands. As the context lengthens, the attention process demands increasing memory and computations, which in turn reduces the throughput of the language model. To address this challenge, we propose a compressed context memory system that continually compresses the accumulating attention key/value pairs into a compact memory space, facilitating language model inference in a limited memory space of computing environments. Our compression process involves integrating a lightweight conditional LoRA into the language model's forward pass during inference, without the need for fine-tuning the model's entire set of weights. We achieve efficient training by modeling the recursive compression process as a single parallelized forward computation. Through evaluations on conversation, personalization, and multi-task learning, we demonstrate that our approach achieves the performance level of a full context model with $5\times$ smaller context memory size. We further demonstrate the applicability of our approach in a streaming setting with an unlimited context length, outperforming the sliding window approach. Codes are available at https://github.com/snu-mllab/context-memory.

Compressed Context Memory For Online Language Model Interaction

TL;DR

The paper tackles the challenge of online language model inference with ever-expanding context by introducing Compressed Context Memory (CCM), a memory-augmented framework that compresses accumulating attention key/value pairs into a compact Mem(t) using a dedicated COMP token. It combines a lightweight conditional LoRA adapter with two memory schemes (CCM-concat and CCM-merge) and a parallelized training strategy to enable efficient, end-to-end optimization without full fine-tuning. Empirical results across conversation, personalization, and multi-task learning show CCM can reach full-context performance with substantially smaller memory footprints, and it outperforms sliding-window baselines in streaming settings. The approach demonstrates strong generalization with a unified compression adapter and favorable memory/throughput trade-offs, offering practical benefits for memory-constrained online deployments of large language models.

Abstract

This paper presents a context key/value compression method for Transformer language models in online scenarios, where the context continually expands. As the context lengthens, the attention process demands increasing memory and computations, which in turn reduces the throughput of the language model. To address this challenge, we propose a compressed context memory system that continually compresses the accumulating attention key/value pairs into a compact memory space, facilitating language model inference in a limited memory space of computing environments. Our compression process involves integrating a lightweight conditional LoRA into the language model's forward pass during inference, without the need for fine-tuning the model's entire set of weights. We achieve efficient training by modeling the recursive compression process as a single parallelized forward computation. Through evaluations on conversation, personalization, and multi-task learning, we demonstrate that our approach achieves the performance level of a full context model with smaller context memory size. We further demonstrate the applicability of our approach in a streaming setting with an unlimited context length, outperforming the sliding window approach. Codes are available at https://github.com/snu-mllab/context-memory.
Paper Structure (51 sections, 5 equations, 10 figures, 30 tables, 1 algorithm)

This paper contains 51 sections, 5 equations, 10 figures, 30 tables, 1 algorithm.

Figures (10)

  • Figure 1: Main concept of online inference systems. Left: Conventional online inference approach. Right: The proposed system with compressed context memory. The colored boxes represent attention keys/values (or input tokens) required for Transformer inference. The new context refers to the sequence comprising an input and a model output from the preceding interaction.
  • Figure 2: The illustration of the compression process at time step $t$. Each colored box symbolizes attention hidden states.
  • Figure 3: Illustration of the parallelized training process. In (a), each colored box symbolizes attention keys/values of memory, compression tokens, and normal text tokens. In (b), gray indicates that attention is blocked. In the figures, $\langle \text{C}\rangle$ stands for $\langle \texttt{COMP} \rangle$. At each layer, after the parallel updates of compressed context memory, the attention operation occurs with the mask in (b). Note the calculation of $\text{Mem}(t)$ occurs after $c(t)$ and its subsequent $\langle \texttt{COMP} \rangle$ token. Reordering the top row of (b) to align with this temporal relation yields an autoregressive mask.
  • Figure 4: Feed forward operations of our conditional LoRA.
  • Figure 5: Illustration of the compression and inference processes at time step $t$. The arrow indicates the process of referencing the keys/values on the left to generate the keys/values on the right. Here, $l_c$ means the expected length of key/value pairs of context $c(\cdot)$, and $l_i$ denotes the total length of input and output. We assume that each compression outcome has a length of $1$. Notations at the top of $\text{Mem}(\cdot)$ denote the length of key/value pairs corresponding to CCM-concat/-merge.
  • ...and 5 more figures