LLoCO: Learning Long Contexts Offline

Sijun Tan; Xiuyu Li; Shishir Patil; Ziyang Wu; Tianjun Zhang; Kurt Keutzer; Joseph E. Gonzalez; Raluca Ada Popa

LLoCO: Learning Long Contexts Offline

Sijun Tan, Xiuyu Li, Shishir Patil, Ziyang Wu, Tianjun Zhang, Kurt Keutzer, Joseph E. Gonzalez, Raluca Ada Popa

TL;DR

The proposed LLoCO proposes LLoCO, a novel approach to address this problem by learning contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA, which substantially reduces the cost of long document question answering.

Abstract

Processing long contexts remains a challenge for large language models (LLMs) due to the quadratic computational and memory overhead of the self-attention mechanism and the substantial KV cache sizes during generation. We propose LLoCO, a novel approach to address this problem by learning contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA. Our method enables an LLM to create a concise representation of the original context and efficiently retrieve relevant information to answer questions accurately. Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens. We evaluate our approach on several long-context question-answering datasets, demonstrating that LLoCO significantly outperforms in-context learning while using $30\times$ fewer tokens during inference. LLoCO achieves up to $7.62\times$ speed-up during inference and $11.52\times$ higher throughput during finetuning, substantially reduces the cost of long document question answering. This makes it a promising solution for efficient long context processing. Our code is publicly available on https://github.com/jeffreysijuntan/lloco.

LLoCO: Learning Long Contexts Offline

TL;DR

Abstract

fewer tokens during inference. LLoCO achieves up to

speed-up during inference and

higher throughput during finetuning, substantially reduces the cost of long document question answering. This makes it a promising solution for efficient long context processing. Our code is publicly available on https://github.com/jeffreysijuntan/lloco.

Paper Structure (22 sections, 1 equation, 7 figures, 10 tables)

This paper contains 22 sections, 1 equation, 7 figures, 10 tables.

Introduction
Related work
Method
Architecture Overview
Pipeline for Offline Context Learning
Experiments
Long Document QA
Ablation Study
Evaluation on LongBench
Needle In A Haystack
Inference Latency
Conclusion
Limitations
Extended Experimental Settings
More Details on Datasets
...and 7 more sections

Figures (7)

Figure 1: The architecture of regular LLM (left) vs LLoCO (right). In regular LLMs, long contexts are appended directly to the prompt. In contrast, LLoCO first processes these contexts through a context encoder. The resulting summary token embeddings are then prepended to the LLM’s prompt, which are significantly shorter. LLoCO instruction finetunes on embeddings of targeted document groups using a LoRA module. This aligns the LLM’s embedding space with the summary embeddings while keeping both the LLM and context encoder unchanged.
Figure 2: Impact of compression ratio on LLoCO's performance.
Figure 3: Fixed needle retrieval task. The sampled article ("haystack") starts with " Mary, ..., a gentle, fashionable girl...", and a context-relevant needle was curated as " Mary's favorite fashion designer was Coco Chanel when she was a teenager. Who was Mary's favorite fashion designer when she was a teenager?"
Figure 4: Random needle retrieval with city-word pairs.
Figure 5: End-to-end decoding per-token latency (ms) on A100 and A6000 GPUs. LLaMA2 without compression runs out of VRAM for sequences of 64k and 128k on A100, and for 32k sequences on A6000.
...and 2 more figures

LLoCO: Learning Long Contexts Offline

TL;DR

Abstract

LLoCO: Learning Long Contexts Offline

Authors

TL;DR

Abstract

Table of Contents

Figures (7)