LoLA: Low-Rank Linear Attention With Sparse Caching

Luke McDermott; Robert W. Heath; Rahul Parhi

LoLA: Low-Rank Linear Attention With Sparse Caching

Luke McDermott, Robert W. Heath, Rahul Parhi

TL;DR

LoLA addresses the memory bottleneck of transformer-style in-context learning by augmenting linear attention with a sparse-caching mechanism that preserves constant memory. It partitions past KV pairs into a local sliding window, a sparse global cache for difficult-to-remember pairs, and a recurrent hidden-state reservoir, guided by a self-recall error metric that identifies memory collisions. The approach yields dramatic gains in long-context associative recall (e.g., from 0.6% to 97.4% on needle-in-a-haystack tasks at 4K context) with a small cache, and it improves zero-shot commonsense reasoning on 1B–8B subquadratic models. This training-free inference strategy expands the practical reach of subquadratic LLMs to lifelong in-context learning scenarios, offering a flexible, cache-tunable trade-off between memory footprint and recall performance.

Abstract

The per-token cost of transformer inference scales with context length, preventing its application to lifelong in-context learning. Linear attention is an efficient alternative that maintains a constant memory footprint, even on infinite context lengths. While this is a potential candidate for lifelong learning, it falls short in memory capacity. In this paper, we propose LoLA, a training-free augmentation to linear attention that boosts associative recall. LoLA distributes past key-value pairs from context into three memory systems: (i) recent pairs in a local sliding window cache; (ii) difficult-to-memorize pairs in a sparse, global cache; and (iii) generic pairs in the recurrent hidden state of linear attention. We show through ablations that our self-recall error metric is crucial to efficiently manage long-term associative memories. On pass-key retrieval tasks, LoLA improves the base model's performance from 0.6% to 97.4% accuracy. This is achieved with a 4.6x smaller cache than Llama-3.1 8B on 4K context length. LoLA also outperforms other 1B and 8B parameter subquadratic models on zero-shot commonsense reasoning tasks.

LoLA: Low-Rank Linear Attention With Sparse Caching

TL;DR

Abstract

LoLA: Low-Rank Linear Attention With Sparse Caching

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)