Table of Contents
Fetching ...

DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging

Neha Verma, Kenton Murray, Kevin Duh

TL;DR

Empirical results show that compared to simple or state-of-the-art neuron width-pruning techniques, DOTResize serves as a useful add-on to pruning, while achieving measurable reductions in real-world computational cost.

Abstract

Structured pruning methods designed for Large Language Models (LLMs) generally focus on identifying and removing the least important components to optimize model size. However, in this work, we question this prevalent approach by instead exploring how to recombine information from structures designated for pruning back into the reduced model. We specifically focus on neuron width reduction, and frame this problem as a Discrete Optimal Transport problem, and propose DOTResize, a novel Transformer compression method that uses optimal transport theory to transform and compress model width. To ensure applicability within the Transformer architecture, we motivate and incorporate necessary entropic regularization and matrix factorization techniques into the transportation maps produced by our method. Unlike pruning-based approaches which discard neurons based on importance measures, DOTResize re-projects the entire neuron width, allowing the retention and redistribution of useful signal across the reduced layer. Empirical results show that compared to simple or state-of-the-art neuron width-pruning techniques, DOTResize serves as a useful add-on to pruning, while achieving measurable reductions in real-world computational cost.

DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging

TL;DR

Empirical results show that compared to simple or state-of-the-art neuron width-pruning techniques, DOTResize serves as a useful add-on to pruning, while achieving measurable reductions in real-world computational cost.

Abstract

Structured pruning methods designed for Large Language Models (LLMs) generally focus on identifying and removing the least important components to optimize model size. However, in this work, we question this prevalent approach by instead exploring how to recombine information from structures designated for pruning back into the reduced model. We specifically focus on neuron width reduction, and frame this problem as a Discrete Optimal Transport problem, and propose DOTResize, a novel Transformer compression method that uses optimal transport theory to transform and compress model width. To ensure applicability within the Transformer architecture, we motivate and incorporate necessary entropic regularization and matrix factorization techniques into the transportation maps produced by our method. Unlike pruning-based approaches which discard neurons based on importance measures, DOTResize re-projects the entire neuron width, allowing the retention and redistribution of useful signal across the reduced layer. Empirical results show that compared to simple or state-of-the-art neuron width-pruning techniques, DOTResize serves as a useful add-on to pruning, while achieving measurable reductions in real-world computational cost.

Paper Structure

This paper contains 23 sections, 1 theorem, 13 equations, 6 figures, 2 tables.

Key Result

Theorem 2.1

Let $T \in \mathbb{R}^{d_{\text{orig}} \times d_{\text{new}}}$ be QR-decomposed into $T = QR$, where semi-orthogonal $Q \in \mathbb{R}^{d_{\text{orig}} \times d_{\text{new}}}$ has orthonormal columns ($Q^T Q = I_{d_{\text{new}}}$) and $R \in \mathbb{R}^{d_{\text{new}} \times d_{\text{new}}}$ is uppe

Figures (6)

  • Figure 1: A depiction of our neuron width merging strategy. In panel 1, we demonstrate computing activations from layer $K$ in preparation for panel 2, where we select a subset of 3 neurons from this layer, and compute pairwise similarities between the activations of the 5 original neurons, and the activations of the subset. In panel 3, we compute the optimal transport map, depicted in green, by optimizing the map according to the similarities and entropic regularization. Finally, we demonstrate replacing layer $K$ with the subset of neurons, after transforming its weights with $T$ and layer $K+1$'s weights with $T^{\text{inv}}$, resulting in new activations.
  • Figure 2: Our QR-decomposition step allows for general invertible matrices to apply at residual junctions as depicted by the figure. The figure depicts a general residual connection as $\bigoplus$ in a pre-norm Transformer layer. While orthogonal transformations are naturally invariant to RMSNorm, general invertible matrices are not unless decomposed via QR decomposition and re-routed as shown. The associativity of matrix multiplication allows us to absorb matrix R into the $T^{-1}$ calculation, allowing the orthogonal multiplicand to not change RMSNorm.
  • Figure 3: Sparsity vs performance tradeoff across different levels of compression on Llama-3.1-70B. Sparsity (%) indicates how much width is removed from weight matrices. After 20%, real world compute cost reduction can be observed.
  • Figure 4: Performance on Wikitext-2 at 20% and 30% width reduction with different entropic regularization $\lambda$. Our method is not sensitive to the amount of regularization for these models.
  • Figure 5: Performance at 20% sparsity with varying exemplar tokens. After approximately 130K tokens, the returns on using more calibration data appear diminishing.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 2.1: QR invariance for rectangular maps
  • proof