Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Andres Potapczynski; Shikai Qiu; Marc Finzi; Christopher Ferri; Zixi Chen; Micah Goldblum; Bayan Bruss; Christopher De Sa; Andrew Gordon Wilson

Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Andres Potapczynski, Shikai Qiu, Marc Finzi, Christopher Ferri, Zixi Chen, Micah Goldblum, Bayan Bruss, Christopher De Sa, Andrew Gordon Wilson

TL;DR

This work proposes BTT-MoE, a novel Mixture-of-Experts (MoE) architecture obtained by sparsifying computation in the BTT structure, and finds BTT-MoE provides a substantial compute-efficiency gain over dense layers and standard MoE.

Abstract

Dense linear layers are the dominant computational bottleneck in large neural networks, presenting a critical need for more efficient alternatives. Previous efforts focused on a small number of hand-crafted structured matrices and neglected to investigate whether these structures can surpass dense layers in terms of compute-optimal scaling laws when both the model size and training examples are optimally allocated. In this work, we present a unifying framework that enables searching among all linear operators expressible via an Einstein summation. This framework encompasses many previously proposed structures, such as low-rank, Kronecker, Tensor-Train, Block Tensor-Train (BTT), and Monarch, along with many novel structures. To analyze the framework, we develop a taxonomy of all such operators based on their computational and algebraic properties and show that differences in the compute-optimal scaling laws are mostly governed by a small number of variables that we introduce. Namely, a small $ω$ (which measures parameter sharing) and large $ψ$ (which measures the rank) reliably led to better scaling laws. Guided by the insight that full-rank structures that maximize parameters per unit of compute perform the best, we propose BTT-MoE, a novel Mixture-of-Experts (MoE) architecture obtained by sparsifying computation in the BTT structure. In contrast to the standard sparse MoE for each entire feed-forward network, BTT-MoE learns an MoE in every single linear layer of the model, including the projection matrices in the attention blocks. We find BTT-MoE provides a substantial compute-efficiency gain over dense layers and standard MoE.

Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

TL;DR

Abstract

(which measures parameter sharing) and large

(which measures the rank) reliably led to better scaling laws. Guided by the insight that full-rank structures that maximize parameters per unit of compute perform the best, we propose BTT-MoE, a novel Mixture-of-Experts (MoE) architecture obtained by sparsifying computation in the BTT structure. In contrast to the standard sparse MoE for each entire feed-forward network, BTT-MoE learns an MoE in every single linear layer of the model, including the projection matrices in the attention blocks. We find BTT-MoE provides a substantial compute-efficiency gain over dense layers and standard MoE.

Paper Structure (37 sections, 23 equations, 10 figures)

This paper contains 37 sections, 23 equations, 10 figures.

Introduction
Parameterizing the Space of Einsums
A Taxonomy of the Space of Einsum Linear Structures
Scaling Laws of Einsums
Main Experimental Setup
Analyzing the Compute-Optimal Scaling Laws
Our Findings Generalize to Other Settings
Structured Mixture of Experts
More Parameters than FLOPs via Mixture of Experts
Compute Efficiency Gains
Effect of Structures
Scaling Optimization for Einsums
Conclusion
Examples of Einsums
Dense
...and 22 more sections

Figures (10)

Figure 1: We use Einsums to parameterize a wide range of structured matrices and search for the most efficient structure for compute-optimal training.Left: A diagrammatic representation of a general two-factor Einsum. We parameterize the space of Einsums through a real-valued vector $\bm{\theta}=(\theta_\alpha,\theta_\beta,\theta_\gamma,\theta_\delta,\theta_\epsilon,\theta_\phi,\theta_\rho) \in [0,1]^{7}$. This space captures many well-known structures through specific values of $\bm{\theta}$. Middle: Example of well-known structures with their $\bm{\theta}$ values. Any omitted line implies the value of the entry in the vector is 0. Right: Compute-optimal scaling laws of example structures for GPT-2 on OpenWebText when substituting its dense layers (see details in \ref{['sec:experiments']}).
Figure 2: Illustrating the Einsum taxonomy. The 3D graph represents relevant quantities of the Einsum structure such as the amount of parameter sharing $\omega$ (x-axis), its rank $\psi$ (y-axis), and its compute intensity $\nu$ (z-axis). The structures on the left of the figure appear as dots on the graph based on their coordinates $\bm{\theta}$. We highlight two key subspaces. (a) The BTT subspace, characterized by no parameter sharing $\omega=0,$ learning the maximum number of parameters per FLOP. (b) The full-rank BTT subspace where $\omega=0$ and $\psi=1$. In \ref{['sec:experiments']} we show that the full-rank BTT subspace contains the most performant structures across multiple tasks.
Figure 3: Compute-optimal frontier (highlighted points) of various Einsums follows power law scaling. As a result, Einsums can be scaled to reach arbitrarily low reducible loss, each with a different rate that can be estimated from small-scale experiments.
Figure 4: The taxonomy parameters $(\omega, \psi)$ explain differences in the scaling laws. (Left): parameter sharing ($\omega > 0$) leads to worse scaling. (Middle): among structures without parameter sharing ($\omega = 0$), full-rank structures ($\psi=1$) scale better than low-rank structures ($\psi<1$). (Right): in the $(\omega = 0, \psi = 1)$ subspace, various structures have nearly indistinguishable scaling laws compared to dense matrices, suggesting that not implementing parameter sharing and being full-rank are the necessary and sufficient conditions for a compute-efficient linear layer for GPT-2.
Figure 5: Our findings about the effect of $(\omega, \psi, \nu)$ on the scaling laws generalize to other settings. (Top row) Transformers trained with cross-entropy for autoregressive pixel generation on CIFAR-5M. (Bottom row) MLP trained with mean-squared-error loss on synthetic data generated by a large and randomly initialized MLP.
...and 5 more figures

Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

TL;DR

Abstract

Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Authors

TL;DR

Abstract

Table of Contents

Figures (10)