ABC: Attention with Bounded-memory Control
Hao Peng, Jungo Kasai, Nikolaos Pappas, Dani Yogatama, Zhaofeng Wu, Lingpeng Kong, Roy Schwartz, Noah A. Smith
TL;DR
The paper introduces Attention with Bounded-Memory Control (Abc), a unified framework that bounds transformer attention memory to a fixed size, yielding linear-time and linear-space complexity $O(Nn)$. It unifies existing efficient attention methods (e.g., Linformer, clustering-based, sliding-window) as Abc instances and introduces Abc_MLP, a learned, context-dependent memory controller that writes tokens into memory via a small MLP. Across language modeling, machine translation, and MLM finetuning, Abc_MLP achieves higher accuracy than prior Abc variants while using much smaller memory, and provides substantial inference-time speedups with minimal accuracy loss compared to strong baselines. The work also reveals that Linformer can be effectively employed in causal attention under the Abc view, and it offers insights and practical guidance for designing future memory-efficient transformer architectures.
Abstract
Transformer architectures have achieved state-of-the-art results on a variety of sequence modeling tasks. However, their attention mechanism comes with a quadratic complexity in sequence lengths, making the computational overhead prohibitive, especially for long sequences. Attention context can be seen as a random-access memory with each token taking a slot. Under this perspective, the memory size grows linearly with the sequence length, and so does the overhead of reading from it. One way to improve the efficiency is to bound the memory size. We show that disparate approaches can be subsumed into one abstraction, attention with bounded-memory control (ABC), and they vary in their organization of the memory. ABC reveals new, unexplored possibilities. First, it connects several efficient attention variants that would otherwise seem apart. Second, this abstraction gives new insights--an established approach (Wang et al., 2020b) previously thought to be not applicable in causal attention, actually is. Last, we present a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their heuristic memory-organizing functions with a learned, contextualized one. Our experiments on language modeling, machine translation, and masked language model finetuning show that our approach outperforms previous efficient attention models; compared to the strong transformer baselines, it significantly improves the inference time and space efficiency with no or negligible accuracy loss.
