Table of Contents
Fetching ...

Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs

Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, Jinwoo Shin

TL;DR

Problem: LLMs suffer from context-length limits and quadratic self-attention costs. Approach: HOMER is a training-free scheme that divides long inputs into chunks, hierarchically merges adjacent chunks with token reduction, and refines lower-layer embeddings, plus a DFS-based memory ordering that yields $O(\log n)$ memory growth. Contributions: a novel hierarchical context merging framework, propagative refinement, and compatibility with RoPE-scaling methods, demonstrated without fine-tuning on passkey retrieval, QA, and language modeling tasks. Findings: experiments show substantial memory savings (over $70\%$) and speedups (up to $162.6\%$) with long contexts up to $64k$ tokens, while maintaining fluency and accuracy improvements. Significance: enables practical long-context reasoning for pre-trained LLMs in memory-constrained settings and broad deployment possibilities.

Abstract

Large language models (LLMs) have shown remarkable performance in various natural language processing tasks. However, a primary constraint they face is the context limit, i.e., the maximum number of tokens they can process. Previous works have explored architectural changes and modifications in positional encoding to relax the constraint, but they often require expensive training or do not address the computational demands of self-attention. In this paper, we present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations. HOMER uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. Each chunk is then processed collectively, employing a hierarchical strategy that merges adjacent chunks at progressive transformer layers. A token reduction technique precedes each merging, ensuring memory usage efficiency. We also propose an optimized computational order reducing the memory requirement to logarithmically scale with respect to input length, making it especially favorable for environments with tight memory restrictions. Our experiments demonstrate the proposed method's superior performance and memory efficiency, enabling the broader use of LLMs in contexts requiring extended context. Code is available at https://github.com/alinlab/HOMER.

Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs

TL;DR

Problem: LLMs suffer from context-length limits and quadratic self-attention costs. Approach: HOMER is a training-free scheme that divides long inputs into chunks, hierarchically merges adjacent chunks with token reduction, and refines lower-layer embeddings, plus a DFS-based memory ordering that yields memory growth. Contributions: a novel hierarchical context merging framework, propagative refinement, and compatibility with RoPE-scaling methods, demonstrated without fine-tuning on passkey retrieval, QA, and language modeling tasks. Findings: experiments show substantial memory savings (over ) and speedups (up to ) with long contexts up to tokens, while maintaining fluency and accuracy improvements. Significance: enables practical long-context reasoning for pre-trained LLMs in memory-constrained settings and broad deployment possibilities.

Abstract

Large language models (LLMs) have shown remarkable performance in various natural language processing tasks. However, a primary constraint they face is the context limit, i.e., the maximum number of tokens they can process. Previous works have explored architectural changes and modifications in positional encoding to relax the constraint, but they often require expensive training or do not address the computational demands of self-attention. In this paper, we present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations. HOMER uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. Each chunk is then processed collectively, employing a hierarchical strategy that merges adjacent chunks at progressive transformer layers. A token reduction technique precedes each merging, ensuring memory usage efficiency. We also propose an optimized computational order reducing the memory requirement to logarithmically scale with respect to input length, making it especially favorable for environments with tight memory restrictions. Our experiments demonstrate the proposed method's superior performance and memory efficiency, enabling the broader use of LLMs in contexts requiring extended context. Code is available at https://github.com/alinlab/HOMER.
Paper Structure (22 sections, 7 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 22 sections, 7 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: (a) Passkey retrieval accuracy on various context lengths, measured with Llama-2-7b-chat. HOMER maintains reasonable performance for context lengths up to 32K tokens. Detailed comparisons with more baselines are provided in \ref{['tab:passkey']}. (b) The memory requirement for processing long inputs. (c) Average inference time required for generating 100 tokens conditioned on various context lengths. All efficiency measurements are done with a single A100 GPU. The baselines include plain Llama, PI, NTK, and YaRN. Peak memory usage of the baselines at 64k is an estimated value, as they do not fit in a single A100 GPU. Detailed results are provided in \ref{['tab:memory']} and \ref{['sec:app-speed']}.
  • Figure 2: An overview of the proposed hierarchical context merging. We first divide a long context into multiple chunks and independently forward them through the early transformer layers. In the intermediate layers, we merge multiple chunks by concatenation, forming a new, merged chunk. To keep the chunk length bounded, we apply token reduction on the original chunks to make them shorter, prior to merging. This process is repeated until all chunks are merged into a single chunk. Finally, we further refine the lower-layer embeddings to get a compact fixed-length, layer-wise embedding. The embedding can then be used like a standard kv-cache chen2022kvcache.
  • Figure 3: Hierarchical context merging process conceptualized as a binary tree. The top-left numbers of each node denote the memory-efficient computation order. Note that propagative refinement must be applied after processing each node to enjoy the optimized memory usage.
  • Figure 4: Illustration of the propagative refinement process.
  • Figure 5: Perplexity plot on 25 long documents from PG-19 dataset rae2019compressive, measured with Llama-2-7b. HOMER consistently achieves low perplexity across long documents up to 64K tokens, demonstrating its ability to remain fluent while conditioned on very long inputs. Detailed comparison with more baselines are provided in \ref{['tab:perplexity']}.
  • ...and 1 more figures