Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

Rupert Mitchell; Kristian Kersting

Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

Rupert Mitchell, Kristian Kersting

TL;DR

MuSe tackles the quadratic bottleneck of softmax attention in long-context pretraining by introducing a two-level, query-key semantic clustering approach. It clusters queries and keys separately in their learned representations, builds query-specific, exponentially tilted summaries, and augments them with selective exact retrieval from the most relevant clusters, enabling a linear-like, fast approximation that remains compatible with pretrained models at test time. Empirical results show MuSe delivers up to a 36% wall-clock speedup at 64k context on 1B-scale models while preserving training quality and long-context utilization; it also generalizes to existing pretrained Llama models with minimal adaptation. The approach provides a practical pathway to scalable long-context pretraining with minimal architectural disruption, and the authors provide extensive ablations demonstrating the importance of query clustering and the effectiveness of retrieval-based corrections. Overall, MuSe achieves strong speedups without sacrificing accuracy and offers a flexible framework for accelerating attention in large-scale transformers, with potential for future kernel optimizations and broader applicability.

Abstract

Pretraining transformers on long sequences (entire code repositories, collections of related documents) is bottlenecked by quadratic attention costs. We present Multipole Semantic Attention (MuSe), which accelerates 64k-context pretraining by 36% while matching baseline loss, requiring no architectural changes. MuSe clusters queries and keys separately in representation space. This yields query-specific summaries that substantially outperform spatial blocking at matched sparsity, while also enabling drop-in compatibility with existing pretrained models; we validate on Llama 3.1-8B and 3.2-1B without retraining. We pretrain language models up to 1B parameters at 64k context on code and scientific documents, confirming that MuSe preserves quality and long-context utilization during training.

Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

TL;DR

Abstract

Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)