Table of Contents
Fetching ...

Macformer: Transformer with Random Maclaurin Feature Attention

Yuhan Guo, Lizhong Ding, Ye Yuan, Guoren Wang

TL;DR

Macformer tackles the quadratic self-attention bottleneck by using Random Maclaurin Features to linearly approximate dot-product kernels, enabling linear-time attention. It introduces RMFA and ppSBN, providing unbiased estimates and explicit error control for kernelized attention. Experiments on synthetic data and the Long Range Arena benchmark show that Macformer delivers substantial speedups while maintaining competitive accuracy across several kernels, with ppSBN stabilizing training. Overall, the work offers a flexible, theoretically grounded framework for kernelized attention in long-sequence modeling.

Abstract

Random feature attention (RFA) adopts random fourier feature (RFF) methods to approximate the softmax function, resulting in a linear time and space attention mechanism that enables the construction of an efficient Transformer. Inspired by RFA, we propose Macformer, a Transformer architecture that employs random Maclaurin features (RMF) to approximate various dot-product kernels, thereby accelerating attention computations for long sequence. Macformer consists of Random Maclaurin Feature Attention (RMFA) and pre-post Scaling Batch Normalization (ppSBN), the former is an unbiased approximation for dot-product kernelized attention and the later is a two-stage regularization mechanism guaranteeing the error of RMFA. We conducted toy experiments to demonstrate the efficiency of RMFA and ppSBN, and experiments on long range arena (LRA) benchmark to validate the acceleration and accuracy of Macformer with different dot-product kernels. Experiment results of Macformer are consistent with our theoretical analysis.

Macformer: Transformer with Random Maclaurin Feature Attention

TL;DR

Macformer tackles the quadratic self-attention bottleneck by using Random Maclaurin Features to linearly approximate dot-product kernels, enabling linear-time attention. It introduces RMFA and ppSBN, providing unbiased estimates and explicit error control for kernelized attention. Experiments on synthetic data and the Long Range Arena benchmark show that Macformer delivers substantial speedups while maintaining competitive accuracy across several kernels, with ppSBN stabilizing training. Overall, the work offers a flexible, theoretically grounded framework for kernelized attention in long-sequence modeling.

Abstract

Random feature attention (RFA) adopts random fourier feature (RFF) methods to approximate the softmax function, resulting in a linear time and space attention mechanism that enables the construction of an efficient Transformer. Inspired by RFA, we propose Macformer, a Transformer architecture that employs random Maclaurin features (RMF) to approximate various dot-product kernels, thereby accelerating attention computations for long sequence. Macformer consists of Random Maclaurin Feature Attention (RMFA) and pre-post Scaling Batch Normalization (ppSBN), the former is an unbiased approximation for dot-product kernelized attention and the later is a two-stage regularization mechanism guaranteeing the error of RMFA. We conducted toy experiments to demonstrate the efficiency of RMFA and ppSBN, and experiments on long range arena (LRA) benchmark to validate the acceleration and accuracy of Macformer with different dot-product kernels. Experiment results of Macformer are consistent with our theoretical analysis.
Paper Structure (13 sections, 3 theorems, 16 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 13 sections, 3 theorems, 16 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Suppose attention inputs $\bm{Q},\bm{K}\in \ell_2 (0,1)$, $\Phi (\cdot)$ defines a Random Maclaurin Feature map for a dot-product kernel $\mathcal{K}(\cdot)$, then for every $\bm{V}\subset \mathbb{R}^{n \times d}$, we have $\mathbb{E}[{\rm RMFA}_\mathcal{K}(\bm{Q},\bm{K},\bm{V})]={\rm attn}_\mathcal

Figures (4)

  • Figure 1: The Macformer architecture improves the multi-head attention component of the original Transformer, with RMFA being wrapped by the preSBN and postSBN layers.
  • Figure 2: Computation graphs for Softmax attention and RMFA. In each figure, the data on the left represents the input to the attention layer. Here, operators $(\cdot)$, $\odot$, and $\otimes$ respectively denote matrix multiplication, mask fill, and outer product. The main time complexity caused by computations is marked on the left side of the operators, and the dimensions of data and intermediate results are indicated with superscripts.
  • Figure 3: The loss, perplexity, and Bleu scores of the traditional Transformer with and without ppSBN across training epochs. In each plot, solid lines represent the Transformer with ppSBN, while dashed lines represent the Transformer without ppSBN.
  • Figure 4: The error (a) and acceleration (b) of ${\rm RMFA}_{\rm exp}$ compared to Softmax attention for different sequence lengths and values of $D$. The data in the figure has been subjected to smoothing, where darker colors represent smaller data values and lighter colors represent larger data values.

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof