Table of Contents
Fetching ...

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

Daniel Y. Fu, Simran Arora, Jessica Grogan, Isys Johnson, Sabri Eyuboglu, Armin W. Thomas, Benjamin Spector, Michael Poli, Atri Rudra, Christopher Ré

TL;DR

Monarch Mixer (M2) introduces Monarch matrices as expressive, sub-quadratic primitives that mix information along both sequence and model axes. By framing Monarch multiplication as polynomial evaluation and interpolation, the authors show how to achieve causality without attention or MLPs, enabling GPT-style modeling at sub-quadratic cost. Across non-causal NLP, ViT-style vision, and causal language modeling, M2 demonstrates competitive quality with substantially fewer parameters and notable throughput gains on modern GPUs. While promising, the work also highlights the need for further system optimizations and broader evaluation to establish widespread applicability of Monarch-based architectures.

Abstract

Machine learning models are increasingly being scaled in both sequence length and model dimension to reach longer contexts and better performance. However, existing architectures such as Transformers scale quadratically along both these axes. We ask: are there performant architectures that can scale sub-quadratically along sequence length and model dimension? We introduce Monarch Mixer (M2), a new architecture that uses the same sub-quadratic primitive along both sequence length and model dimension: Monarch matrices, a simple class of expressive structured matrices that captures many linear transforms, achieves high hardware efficiency on GPUs, and scales sub-quadratically. As a proof of concept, we explore the performance of M2 in three domains: non-causal BERT-style language modeling, ViT-style image classification, and causal GPT-style language modeling. For non-causal BERT-style modeling, M2 matches BERT-base and BERT-large in downstream GLUE quality with up to 27% fewer parameters, and achieves up to 9.1$\times$ higher throughput at sequence length 4K. On ImageNet, M2 outperforms ViT-b by 1% in accuracy, with only half the parameters. Causal GPT-style models introduce a technical challenge: enforcing causality via masking introduces a quadratic bottleneck. To alleviate this bottleneck, we develop a novel theoretical view of Monarch matrices based on multivariate polynomial evaluation and interpolation, which lets us parameterize M2 to be causal while remaining sub-quadratic. Using this parameterization, M2 matches GPT-style Transformers at 360M parameters in pretraining perplexity on The PILE--showing for the first time that it may be possible to match Transformer quality without attention or MLPs.

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

TL;DR

Monarch Mixer (M2) introduces Monarch matrices as expressive, sub-quadratic primitives that mix information along both sequence and model axes. By framing Monarch multiplication as polynomial evaluation and interpolation, the authors show how to achieve causality without attention or MLPs, enabling GPT-style modeling at sub-quadratic cost. Across non-causal NLP, ViT-style vision, and causal language modeling, M2 demonstrates competitive quality with substantially fewer parameters and notable throughput gains on modern GPUs. While promising, the work also highlights the need for further system optimizations and broader evaluation to establish widespread applicability of Monarch-based architectures.

Abstract

Machine learning models are increasingly being scaled in both sequence length and model dimension to reach longer contexts and better performance. However, existing architectures such as Transformers scale quadratically along both these axes. We ask: are there performant architectures that can scale sub-quadratically along sequence length and model dimension? We introduce Monarch Mixer (M2), a new architecture that uses the same sub-quadratic primitive along both sequence length and model dimension: Monarch matrices, a simple class of expressive structured matrices that captures many linear transforms, achieves high hardware efficiency on GPUs, and scales sub-quadratically. As a proof of concept, we explore the performance of M2 in three domains: non-causal BERT-style language modeling, ViT-style image classification, and causal GPT-style language modeling. For non-causal BERT-style modeling, M2 matches BERT-base and BERT-large in downstream GLUE quality with up to 27% fewer parameters, and achieves up to 9.1 higher throughput at sequence length 4K. On ImageNet, M2 outperforms ViT-b by 1% in accuracy, with only half the parameters. Causal GPT-style models introduce a technical challenge: enforcing causality via masking introduces a quadratic bottleneck. To alleviate this bottleneck, we develop a novel theoretical view of Monarch matrices based on multivariate polynomial evaluation and interpolation, which lets us parameterize M2 to be causal while remaining sub-quadratic. Using this parameterization, M2 matches GPT-style Transformers at 360M parameters in pretraining perplexity on The PILE--showing for the first time that it may be possible to match Transformer quality without attention or MLPs.
Paper Structure (81 sections, 45 theorems, 229 equations, 4 figures, 20 tables, 6 algorithms)

This paper contains 81 sections, 45 theorems, 229 equations, 4 figures, 20 tables, 6 algorithms.

Key Result

Theorem 1

Let $m(j) = j\mod\sqrt{N}$. For any vector $\mathbf{u} \in \mathbb{R}^N$, ${\mathbf{M}} \mathbf{u}$ is a bivariate polynomial $u(X, Y)$ evaluated at $A^2$, with $u(X, Y) = \sum_{ j = 0 }^{N-1} u_j f_j(X, Y),$ where $f_j(X, Y) = \ell_{m(j)} (X, Y) r_j(Y)$.

Figures (4)

  • Figure 1: Monarch matrices are a simple, expressive, and hardware-efficient class of sub-quadratic structured matrices. Monarch Mixer (M2) uses Monarch matrices to mix inputs first along the sequence dimension and then along the model dimension. See the Appendix for PyTorch implementation of an M2 layer.
  • Figure 2: Monarch multiplication can be interpreted as polynomial evaluation and interpolation. We derive sufficient conditions on the polynomial formulation of Monarch matrices for M2 to be causal.
  • Figure 3: M2-BERT uses Monarch matrices to create a bidirectional gated long convolution in the sequence mixer, and uses Monarch matrices to replace the linear layers in the dimension mixer.
  • Figure 4: Roofline plot of a PyTorch implementation of a single M2 operator ${\mathbf{M}}^{-1} ({\mathbf{M}} \mathbf{u} \odot {\mathbf{M}} \mathbf{k})$.

Theorems & Definitions (91)

  • Theorem 1
  • Theorem 2
  • Definition 1
  • Theorem 3
  • Definition 2: Degree
  • Definition 3
  • Theorem 4
  • proof
  • Theorem 5
  • proof
  • ...and 81 more