Transformer-VQ: Linear-Time Transformers via Vector Quantization
Lucas D. Lingle
TL;DR
Transformer-VQ delivers a decoder-only transformer that computes dense self-attention in linear time by quantizing keys and caching compressed statistics. The approach preserves a dense-attention equivalence while enabling efficient training and sampling, backed by large-scale experiments showing strong performance on long-context benchmarks such as Enwik8, PG-19, and ImageNet64. Key contributions include a detailed quadratic-time formulation, a linear-time decoder attention recurrence, a VQ-based learning objective with EMA-codebooks, and extensive ablations demonstrating the impact of codebook size and the compressive cache. The results indicate substantial practical benefits for long-context autoregressive modeling, with competitive accuracy and notable throughput gains that scale to very long sequences.
Abstract
We introduce Transformer-VQ, a decoder-only transformer computing softmax-based dense self-attention in linear time. Transformer-VQ's efficient attention is enabled by vector-quantized keys and a novel caching mechanism. In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on ImageNet64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput. Code available: \url{https://github.com/transformer-vq/transformer_vq}
