LoCoCo: Dropping In Convolutions for Long Context Compression

Ruisi Cai; Yuandong Tian; Zhangyang Wang; Beidi Chen

LoCoCo: Dropping In Convolutions for Long Context Compression

Ruisi Cai, Yuandong Tian, Zhangyang Wang, Beidi Chen

Abstract

This paper tackles the memory hurdle of processing long context sequences in Large Language Models (LLMs), by presenting a novel approach, Dropping In Convolutions for Long Context Compression (LoCoCo). LoCoCo employs only a fixed-size Key-Value (KV) cache, and can enhance efficiency in both inference and fine-tuning stages. Diverging from prior methods that selectively drop KV pairs based on heuristics, LoCoCo leverages a data-driven adaptive fusion technique, blending previous KV pairs with incoming tokens to minimize the loss of contextual information and ensure accurate attention modeling. This token integration is achieved through injecting one-dimensional convolutional kernels that dynamically calculate mixing weights for each KV cache slot. Designed for broad compatibility with existing LLM frameworks, LoCoCo allows for straightforward "drop-in" integration without needing architectural modifications, while incurring minimal tuning overhead. Experiments demonstrate that LoCoCo maintains consistently outstanding performance across various context lengths and can achieve a high context compression rate during both inference and fine-tuning phases. During inference, we successfully compressed up to 3482 tokens into a 128-size KV cache, while retaining comparable performance to the full sequence - an accuracy improvement of up to 0.2791 compared to baselines at the same cache size. During post-training tuning, we also effectively extended the context length from 4K to 32K using a KV cache of fixed size 512, achieving performance similar to fine-tuning with entire sequences.

LoCoCo: Dropping In Convolutions for Long Context Compression

Abstract

Paper Structure (30 sections, 5 equations, 3 figures, 7 tables, 2 algorithms)

This paper contains 30 sections, 5 equations, 3 figures, 7 tables, 2 algorithms.

Introduction
Related Work
Long-Context Inference
Long-Context Fine-tuning
Attention Approximation
Language Model Design with Built-In Convolutions
Methodology
Segment-Level Attention with Long Sequences
Convolution as a Context Compression Operator
Convolutional Token Compressor
Complexity Analysis
Connection with Token Dropping
Dropping-In Integration of LoCoCo
Long-Context Efficient Inference
Long-Context Extension
...and 15 more sections

Figures (3)

Figure 1: Overview of our pipeline. We process the long sequences block-wisely and maintain a fixed-size compressed memory.
Figure 2: Token merging via convolutional kernels as the drop-in" integration without modifying the original weights. Based on Llama-2-7B touvron2023llama, we inserted the convolutional heads on the top of self-attention, and tested the model performance on various few-shot downstream tasks. The input sequence typically consists of about 2000 tokens. We compare our method with zhang2023h, a token eviction strategy. We also provide the uncompressed case, where the model uses the full sequence.
Figure 3: Varying memory sizes during fine-tuning, evaluated on Proof-Pile-2 azerbayev2023llemma. Compared to zhang2023h, our method shows exceptional performance at large compression ratios, indicating the expressiveness of the merged token.

LoCoCo: Dropping In Convolutions for Long Context Compression

Abstract

LoCoCo: Dropping In Convolutions for Long Context Compression

Authors

Abstract

Table of Contents

Figures (3)