Table of Contents
Fetching ...

Efficient Low Rank Attention for Long-Context Inference in Large Language Models

Tenghui Li, Guoxu Zhou, Xuyang Zhao, Yuning Qiu, Qibin Zhao

TL;DR

This work tackles the KV-cache memory bottleneck in long-context LLMs by introducing LRQK, which jointly factorizes the full-precision query and key matrices into low-rank representations and computes proxy attention scores to identify a small set of relevant KV entries. A mixed GPU-CPU cache with a hit/miss mechanism and a recency buffer minimizes data transfers while preserving exact attention on retrieved tokens. The method demonstrates competitive accuracy on RULER and LongBench across multiple models, while delivering substantial memory savings and enabling longer contexts on constrained hardware. Ablation studies show the effectiveness of the low-rank approximation, top-k active token selection, and lite recency tokens, though CPU indexing remains a bottleneck and hyperparameters require task-specific tuning.

Abstract

As the length of input text grows, the key-value (KV) cache in LLMs imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as KV quantization and pruning, reduce memory usage but suffer from numerical precision loss or suboptimal retention of key-value pairs. We introduce Low Rank Query and Key attention (LRQK), a two-stage framework that jointly decomposes the full-precision query and key matrices into compact rank-\(r\) factors during the prefill stage, and then uses these low-dimensional projections to compute proxy attention scores in \(\mathcal{O}(lr)\) time at each decode step. By selecting only the top-\(k\) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU-CPU cache with a hit-and-miss mechanism that transfers only missing full-precision KV pairs, thereby preserving exact attention outputs while reducing CPU-GPU data movement. Extensive experiments on the RULER and LongBench benchmarks with LLaMA-3-8B and Qwen2.5-7B demonstrate that LRQK matches or surpasses leading sparse-attention methods in long context settings, while delivering significant memory savings with minimal loss in accuracy. Our code is available at https://github.com/tenghuilee/LRQK.

Efficient Low Rank Attention for Long-Context Inference in Large Language Models

TL;DR

This work tackles the KV-cache memory bottleneck in long-context LLMs by introducing LRQK, which jointly factorizes the full-precision query and key matrices into low-rank representations and computes proxy attention scores to identify a small set of relevant KV entries. A mixed GPU-CPU cache with a hit/miss mechanism and a recency buffer minimizes data transfers while preserving exact attention on retrieved tokens. The method demonstrates competitive accuracy on RULER and LongBench across multiple models, while delivering substantial memory savings and enabling longer contexts on constrained hardware. Ablation studies show the effectiveness of the low-rank approximation, top-k active token selection, and lite recency tokens, though CPU indexing remains a bottleneck and hyperparameters require task-specific tuning.

Abstract

As the length of input text grows, the key-value (KV) cache in LLMs imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as KV quantization and pruning, reduce memory usage but suffer from numerical precision loss or suboptimal retention of key-value pairs. We introduce Low Rank Query and Key attention (LRQK), a two-stage framework that jointly decomposes the full-precision query and key matrices into compact rank- factors during the prefill stage, and then uses these low-dimensional projections to compute proxy attention scores in \(\mathcal{O}(lr)\) time at each decode step. By selecting only the top- tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU-CPU cache with a hit-and-miss mechanism that transfers only missing full-precision KV pairs, thereby preserving exact attention outputs while reducing CPU-GPU data movement. Extensive experiments on the RULER and LongBench benchmarks with LLaMA-3-8B and Qwen2.5-7B demonstrate that LRQK matches or surpasses leading sparse-attention methods in long context settings, while delivering significant memory savings with minimal loss in accuracy. Our code is available at https://github.com/tenghuilee/LRQK.

Paper Structure

This paper contains 43 sections, 34 equations, 5 figures, 12 tables, 2 algorithms.

Figures (5)

  • Figure 1: Brief overview of the proposed Low Rank Query and Key attention (LRQK) method. Subscript $\Omega$ denotes the selected tokens, $t$ denotes the current token. $\mathbf{q}_t, \mathbf{k}_t$ are the original query and key, $\widehat{\mathbf{q}}_t, \widehat{\mathbf{k}}_t$ are the approximated query and key. $\mathbf{A}_{K,t}$ is the low rank key matrix. $\mathbf{K}_{\Omega, t-1}', \mathbf{V}_{\Omega, t-1}'$ are GPU cache $\mathbf{K}_{\Omega, t-1}, \mathbf{V}_{\Omega, t-1}$ merged with fetched CPU keys and values.
  • Figure 2: Time comparison with full GPU, full CPU and the proposed LRQK methods. The orange block is the selection operation of KV pairs. The black blocks are cache loading operations. The blocks above line mean GPU operations and the blocks below are CPU operations.
  • Figure 3: Examples of the mean of singular values of the query and key matrix over different layers on Qwen2.5-7B and LLaMA-3-8B-1M models. The singular values are summed over batches and attention heads.
  • Figure 4: Examples of the attention scores of the neighbors of current token. The window size is 16. The $x$-label "current" is the index of the current token, $-8$ means the tokens at $t - 8$, and so on. The attention scores are averaged over batches and attention heads.
  • Figure 5: Histogram of miss rates on wikitext-2-v1.