Table of Contents
Fetching ...

Two are better than one: Context window extension with multi-grained self-injection

Wei Han, Pan Zhou, Soujanya Poria, Shuicheng Yan

TL;DR

SharedLLM is proposed, a novel approach grounded in the design philosophy of multi-grained context compression and query-aware information retrieval that introduces a specialized tree-style data structure to efficiently encode, store and retrieve multi-grained contextual information for text chunks.

Abstract

The limited context window of contemporary large language models (LLMs) remains a huge barrier to their broader application across various domains. While continual pre-training on long-context data is a straightforward and effective solution, it incurs substantial costs in terms of data acquisition and computational resources. To alleviate this issue, we propose SharedLLM, a novel approach grounded in the design philosophy of multi-grained context compression and query-aware information retrieval. SharedLLM is composed of two short-context LLMs such as LLaMA-2, termed upper model and lower model. The lower model functions as a compressor while the upper model acts as a decoder. The upper model receives compressed, multi-grained context information from the lower model and performs context-aware modeling on the running text. Information transfer between the compressor and decoder occurs only at the lowest layers to refrain from long forward paths in the lower model and redundant cross-attention modules in the upper model. Based on this architecture, we introduce a specialized tree-style data structure to efficiently encode, store and retrieve multi-grained contextual information for text chunks. This structure, combined with a search algorithm, enables rapid encoding and retrieval of relevant information from various levels of the tree based on the input query. This entire process, wherein the sender and receiver are derived from the same LLM layer, is referred to as self-injection.

Two are better than one: Context window extension with multi-grained self-injection

TL;DR

SharedLLM is proposed, a novel approach grounded in the design philosophy of multi-grained context compression and query-aware information retrieval that introduces a specialized tree-style data structure to efficiently encode, store and retrieve multi-grained contextual information for text chunks.

Abstract

The limited context window of contemporary large language models (LLMs) remains a huge barrier to their broader application across various domains. While continual pre-training on long-context data is a straightforward and effective solution, it incurs substantial costs in terms of data acquisition and computational resources. To alleviate this issue, we propose SharedLLM, a novel approach grounded in the design philosophy of multi-grained context compression and query-aware information retrieval. SharedLLM is composed of two short-context LLMs such as LLaMA-2, termed upper model and lower model. The lower model functions as a compressor while the upper model acts as a decoder. The upper model receives compressed, multi-grained context information from the lower model and performs context-aware modeling on the running text. Information transfer between the compressor and decoder occurs only at the lowest layers to refrain from long forward paths in the lower model and redundant cross-attention modules in the upper model. Based on this architecture, we introduce a specialized tree-style data structure to efficiently encode, store and retrieve multi-grained contextual information for text chunks. This structure, combined with a search algorithm, enables rapid encoding and retrieval of relevant information from various levels of the tree based on the input query. This entire process, wherein the sender and receiver are derived from the same LLM layer, is referred to as self-injection.

Paper Structure

This paper contains 36 sections, 8 equations, 3 figures, 12 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of SharedLLM. The architecture resembles general encoder-decoder architecture like T5 raffel2020exploring, but the interaction occurs at the first $M$ layers between lower and upper model through shared key-values which are encoded and compressed from the text chunk into a sequence of trees (top-left).
  • Figure 2: An running example of our tree (depth=3). The digits mark the step indices in the split-and-search procedure.
  • Figure 3: Comparison of memory usage (left) and total inference time on 100 examples (right) between SharedLLM and other recent baselines. The data is collected by running a tiny experiment on 100 examples in corresponding lengths. "OOM" means out-of-memory exception triggered during test time.