Table of Contents
Fetching ...

Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Andres Potapczynski, Shikai Qiu, Marc Finzi, Christopher Ferri, Zixi Chen, Micah Goldblum, Bayan Bruss, Christopher De Sa, Andrew Gordon Wilson

TL;DR

This work proposes BTT-MoE, a novel Mixture-of-Experts (MoE) architecture obtained by sparsifying computation in the BTT structure, and finds BTT-MoE provides a substantial compute-efficiency gain over dense layers and standard MoE.

Abstract

Dense linear layers are the dominant computational bottleneck in large neural networks, presenting a critical need for more efficient alternatives. Previous efforts focused on a small number of hand-crafted structured matrices and neglected to investigate whether these structures can surpass dense layers in terms of compute-optimal scaling laws when both the model size and training examples are optimally allocated. In this work, we present a unifying framework that enables searching among all linear operators expressible via an Einstein summation. This framework encompasses many previously proposed structures, such as low-rank, Kronecker, Tensor-Train, Block Tensor-Train (BTT), and Monarch, along with many novel structures. To analyze the framework, we develop a taxonomy of all such operators based on their computational and algebraic properties and show that differences in the compute-optimal scaling laws are mostly governed by a small number of variables that we introduce. Namely, a small $ω$ (which measures parameter sharing) and large $ψ$ (which measures the rank) reliably led to better scaling laws. Guided by the insight that full-rank structures that maximize parameters per unit of compute perform the best, we propose BTT-MoE, a novel Mixture-of-Experts (MoE) architecture obtained by sparsifying computation in the BTT structure. In contrast to the standard sparse MoE for each entire feed-forward network, BTT-MoE learns an MoE in every single linear layer of the model, including the projection matrices in the attention blocks. We find BTT-MoE provides a substantial compute-efficiency gain over dense layers and standard MoE.

Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

TL;DR

This work proposes BTT-MoE, a novel Mixture-of-Experts (MoE) architecture obtained by sparsifying computation in the BTT structure, and finds BTT-MoE provides a substantial compute-efficiency gain over dense layers and standard MoE.

Abstract

Dense linear layers are the dominant computational bottleneck in large neural networks, presenting a critical need for more efficient alternatives. Previous efforts focused on a small number of hand-crafted structured matrices and neglected to investigate whether these structures can surpass dense layers in terms of compute-optimal scaling laws when both the model size and training examples are optimally allocated. In this work, we present a unifying framework that enables searching among all linear operators expressible via an Einstein summation. This framework encompasses many previously proposed structures, such as low-rank, Kronecker, Tensor-Train, Block Tensor-Train (BTT), and Monarch, along with many novel structures. To analyze the framework, we develop a taxonomy of all such operators based on their computational and algebraic properties and show that differences in the compute-optimal scaling laws are mostly governed by a small number of variables that we introduce. Namely, a small (which measures parameter sharing) and large (which measures the rank) reliably led to better scaling laws. Guided by the insight that full-rank structures that maximize parameters per unit of compute perform the best, we propose BTT-MoE, a novel Mixture-of-Experts (MoE) architecture obtained by sparsifying computation in the BTT structure. In contrast to the standard sparse MoE for each entire feed-forward network, BTT-MoE learns an MoE in every single linear layer of the model, including the projection matrices in the attention blocks. We find BTT-MoE provides a substantial compute-efficiency gain over dense layers and standard MoE.
Paper Structure (37 sections, 23 equations, 10 figures)

This paper contains 37 sections, 23 equations, 10 figures.

Figures (10)

  • Figure 1: We use Einsums to parameterize a wide range of structured matrices and search for the most efficient structure for compute-optimal training.Left: A diagrammatic representation of a general two-factor Einsum. We parameterize the space of Einsums through a real-valued vector $\bm{\theta}=(\theta_\alpha,\theta_\beta,\theta_\gamma,\theta_\delta,\theta_\epsilon,\theta_\phi,\theta_\rho) \in [0,1]^{7}$. This space captures many well-known structures through specific values of $\bm{\theta}$. Middle: Example of well-known structures with their $\bm{\theta}$ values. Any omitted line implies the value of the entry in the vector is 0. Right: Compute-optimal scaling laws of example structures for GPT-2 on OpenWebText when substituting its dense layers (see details in \ref{['sec:experiments']}).
  • Figure 2: Illustrating the Einsum taxonomy. The 3D graph represents relevant quantities of the Einsum structure such as the amount of parameter sharing $\omega$ (x-axis), its rank $\psi$ (y-axis), and its compute intensity $\nu$ (z-axis). The structures on the left of the figure appear as dots on the graph based on their coordinates $\bm{\theta}$. We highlight two key subspaces. (a) The BTT subspace, characterized by no parameter sharing $\omega=0,$ learning the maximum number of parameters per FLOP. (b) The full-rank BTT subspace where $\omega=0$ and $\psi=1$. In \ref{['sec:experiments']} we show that the full-rank BTT subspace contains the most performant structures across multiple tasks.
  • Figure 3: Compute-optimal frontier (highlighted points) of various Einsums follows power law scaling. As a result, Einsums can be scaled to reach arbitrarily low reducible loss, each with a different rate that can be estimated from small-scale experiments.
  • Figure 4: The taxonomy parameters $(\omega, \psi)$ explain differences in the scaling laws. (Left): parameter sharing ($\omega > 0$) leads to worse scaling. (Middle): among structures without parameter sharing ($\omega = 0$), full-rank structures ($\psi=1$) scale better than low-rank structures ($\psi<1$). (Right): in the $(\omega = 0, \psi = 1)$ subspace, various structures have nearly indistinguishable scaling laws compared to dense matrices, suggesting that not implementing parameter sharing and being full-rank are the necessary and sufficient conditions for a compute-efficient linear layer for GPT-2.
  • Figure 5: Our findings about the effect of $(\omega, \psi, \nu)$ on the scaling laws generalize to other settings. (Top row) Transformers trained with cross-entropy for autoregressive pixel generation on CIFAR-5M. (Bottom row) MLP trained with mean-squared-error loss on synthetic data generated by a large and randomly initialized MLP.
  • ...and 5 more figures