FLuRKA: Fast and accurate unified Low-Rank & Kernel Attention
Ahan Gupta, Hao Guo, Yueming Yuan, Yanqi Zhou, Charith Mendis
TL;DR
FLuRKA introduces a unified attention mechanism that fuses low-rank and kernel approximations to produce transformers whose step-time is faster than either constituent method. The authors provide a theoretical speed bound and an accuracy bound relative to full-attention, and validate three variants that achieve notable speedups (up to 3.3x vs kernel and 1.7x vs LR) while maintaining competitive accuracy across language modeling, understanding, long-sequence tasks, machine translation, and image classification. Empirical results show FLuRKA variants outperform or match their base components on six benchmarks, with training efficiency enabled by reduced FLOPs and an up-training strategy that blends base models with FLuRKA. The work demonstrates practical impact by enabling faster, scalable transformer models across text and vision tasks, reducing computational costs for high-quality models.
Abstract
Many efficient $\textit{approximate}$ self-attention techniques have become prevalent since the inception of the transformer architecture. Two popular classes of these techniques are low-rank and kernel methods. Each of these methods has its strengths. We observe these strengths synergistically complement each other and exploit them to fuse low-rank and kernel methods, producing a new class of transformers: FLuRKA ($\textbf{F}$ast $\textbf{L}$ow-$\textbf{R}$ank & $\textbf{K}$ernel$ \textbf{A}$ttention). FLuRKA are highly $\textit{training-efficient}$ with faster model speeds $\textit{and}$ similar model qualities compared to constituent low-rank and kernel methods. We theoretically and empirically evaluate the speed and quality of FLuRKA. Our model speed analysis posits a variety of parameter configurations where FLuRKA exhibit speedups over low-rank and kernel approximations and our model quality analysis bounds the error of FLuRKA with respect to full-attention. Empirically, we instantiate three FLuRKA variants which experience speedups of up to 3.3x and 1.7x over low-rank and kernel methods respectively. This translates to speedups of up to 20x over models with flash-attention. Across a diverse set of tasks spanning language modeling, language understanding, long sequence modeling, machine translation, and image classification, FLuRKA achieve comparable accuracy with underlying low-rank and kernel approximations, occasionally surpassing both.
