Table of Contents
Fetching ...

LLoCO: Learning Long Contexts Offline

Sijun Tan, Xiuyu Li, Shishir Patil, Ziyang Wu, Tianjun Zhang, Kurt Keutzer, Joseph E. Gonzalez, Raluca Ada Popa

TL;DR

The proposed LLoCO proposes LLoCO, a novel approach to address this problem by learning contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA, which substantially reduces the cost of long document question answering.

Abstract

Processing long contexts remains a challenge for large language models (LLMs) due to the quadratic computational and memory overhead of the self-attention mechanism and the substantial KV cache sizes during generation. We propose LLoCO, a novel approach to address this problem by learning contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA. Our method enables an LLM to create a concise representation of the original context and efficiently retrieve relevant information to answer questions accurately. Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens. We evaluate our approach on several long-context question-answering datasets, demonstrating that LLoCO significantly outperforms in-context learning while using $30\times$ fewer tokens during inference. LLoCO achieves up to $7.62\times$ speed-up during inference and $11.52\times$ higher throughput during finetuning, substantially reduces the cost of long document question answering. This makes it a promising solution for efficient long context processing. Our code is publicly available on https://github.com/jeffreysijuntan/lloco.

LLoCO: Learning Long Contexts Offline

TL;DR

The proposed LLoCO proposes LLoCO, a novel approach to address this problem by learning contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA, which substantially reduces the cost of long document question answering.

Abstract

Processing long contexts remains a challenge for large language models (LLMs) due to the quadratic computational and memory overhead of the self-attention mechanism and the substantial KV cache sizes during generation. We propose LLoCO, a novel approach to address this problem by learning contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA. Our method enables an LLM to create a concise representation of the original context and efficiently retrieve relevant information to answer questions accurately. Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens. We evaluate our approach on several long-context question-answering datasets, demonstrating that LLoCO significantly outperforms in-context learning while using fewer tokens during inference. LLoCO achieves up to speed-up during inference and higher throughput during finetuning, substantially reduces the cost of long document question answering. This makes it a promising solution for efficient long context processing. Our code is publicly available on https://github.com/jeffreysijuntan/lloco.
Paper Structure (22 sections, 1 equation, 7 figures, 10 tables)

This paper contains 22 sections, 1 equation, 7 figures, 10 tables.

Figures (7)

  • Figure 1: The architecture of regular LLM (left) vs LLoCO (right). In regular LLMs, long contexts are appended directly to the prompt. In contrast, LLoCO first processes these contexts through a context encoder. The resulting summary token embeddings are then prepended to the LLM’s prompt, which are significantly shorter. LLoCO instruction finetunes on embeddings of targeted document groups using a LoRA module. This aligns the LLM’s embedding space with the summary embeddings while keeping both the LLM and context encoder unchanged.
  • Figure 2: Impact of compression ratio on LLoCO's performance.
  • Figure 3: Fixed needle retrieval task. The sampled article ("haystack") starts with " Mary, ..., a gentle, fashionable girl...", and a context-relevant needle was curated as " Mary's favorite fashion designer was Coco Chanel when she was a teenager. Who was Mary's favorite fashion designer when she was a teenager?"
  • Figure 4: Random needle retrieval with city-word pairs.
  • Figure 5: End-to-end decoding per-token latency (ms) on A100 and A6000 GPUs. LLaMA2 without compression runs out of VRAM for sequences of 64k and 128k on A100, and for 32k sequences on A6000.
  • ...and 2 more figures