Table of Contents
Fetching ...

Transformer-VQ: Linear-Time Transformers via Vector Quantization

Lucas D. Lingle

TL;DR

Transformer-VQ delivers a decoder-only transformer that computes dense self-attention in linear time by quantizing keys and caching compressed statistics. The approach preserves a dense-attention equivalence while enabling efficient training and sampling, backed by large-scale experiments showing strong performance on long-context benchmarks such as Enwik8, PG-19, and ImageNet64. Key contributions include a detailed quadratic-time formulation, a linear-time decoder attention recurrence, a VQ-based learning objective with EMA-codebooks, and extensive ablations demonstrating the impact of codebook size and the compressive cache. The results indicate substantial practical benefits for long-context autoregressive modeling, with competitive accuracy and notable throughput gains that scale to very long sequences.

Abstract

We introduce Transformer-VQ, a decoder-only transformer computing softmax-based dense self-attention in linear time. Transformer-VQ's efficient attention is enabled by vector-quantized keys and a novel caching mechanism. In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on ImageNet64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput. Code available: \url{https://github.com/transformer-vq/transformer_vq}

Transformer-VQ: Linear-Time Transformers via Vector Quantization

TL;DR

Transformer-VQ delivers a decoder-only transformer that computes dense self-attention in linear time by quantizing keys and caching compressed statistics. The approach preserves a dense-attention equivalence while enabling efficient training and sampling, backed by large-scale experiments showing strong performance on long-context benchmarks such as Enwik8, PG-19, and ImageNet64. Key contributions include a detailed quadratic-time formulation, a linear-time decoder attention recurrence, a VQ-based learning objective with EMA-codebooks, and extensive ablations demonstrating the impact of codebook size and the compressive cache. The results indicate substantial practical benefits for long-context autoregressive modeling, with competitive accuracy and notable throughput gains that scale to very long sequences.

Abstract

We introduce Transformer-VQ, a decoder-only transformer computing softmax-based dense self-attention in linear time. Transformer-VQ's efficient attention is enabled by vector-quantized keys and a novel caching mechanism. In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on ImageNet64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput. Code available: \url{https://github.com/transformer-vq/transformer_vq}
Paper Structure (47 sections, 7 theorems, 17 equations, 11 figures, 11 tables)

This paper contains 47 sections, 7 theorems, 17 equations, 11 figures, 11 tables.

Key Result

Theorem 2.2

Let $\mathbf{q} \in \mathbb{R}^{D}$ be a random variable with $\mathbb{E}_{\mathbf{q}}[\mathbf{q}\mathbf{q}^{\top}] = \sigma^{2} \mathbf{I}_{D}$ for some $\sigma > 0$, and let $\mathbf{k} \in \mathbb{R}^{D}$ be a random variable independent of $\mathbf{q}$. Let $\varphi : \mathbb{R}^{D} \to \mathbb{

Figures (11)

  • Figure 1: Schematic of the VQ-Attention approximation. The colorful and blank boxes depict the keys and attention weights, respectively. The keys on the right have been vector-quantized. Since the green keys ${k}_{2}, {k}_{5}$ map to the same code, they have the same attention weights in this attention head.
  • Figure 2: Schematic of the VQ-Attention factorization with element-wise $\phi_{w}$. The column set of $\mathbf{W} = \phi_{w}(\mathbf{Q}\hat{\mathbf{K}}^{\top}) \in \mathbb{R}^{T \times T}$ has size $\leq S$ due to VQ, so the attention output $\mathbf{O} = \mathbf{W}\mathbf{V}$ can be obtained by computing the unique attention scores $\phi_{w}(\mathbf{Q}{\mathbf{C}}^{\top})$ and using them to further aggregate to the grouped-sum $\mathbf{\Delta}\mathbf{V}$. Transformer-VQ uses a softmax-based extension of this idea for its cache.
  • Figure 3: Generated samples from our large ImageNet64 model; nucleus 1.0.
  • Figure 4: Sample excerpt from our PG-19 model, generated with nucleus 0.8.
  • Figure 5: Generated samples from our large ImageNet64 model; nucleus 0.999.
  • ...and 6 more figures

Theorems & Definitions (24)

  • Definition 2.1
  • Theorem 2.2: Based on RGuo2019
  • Corollary 2.3
  • Corollary 2.4
  • Remark 2.5
  • Definition 2.6: Based on vanDenOord2017
  • Remark 2.7
  • Remark 2.8
  • Definition 3.1
  • Remark 3.2
  • ...and 14 more