Table of Contents
Fetching ...

ButterflyViT: 354$\times$ Expert Compression for Edge Vision Transformers

Aryan Karmore

TL;DR

ButterflyViT is introduced, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate that allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.

Abstract

Deploying sparse Mixture of Experts(MoE) Vision Transformers remains a challenge due to linear expert memory scaling. Linear memory scaling stores $N$ independent expert weight matrices requiring $\mathcal{O}(N_E \cdot d^2)$ memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyViT, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields $\mathcal{O}(d_{\text{model}} \cdot d_{\text{ff}} + N_E \cdot n_\ell \cdot d)$ memory which is sub-linear in the number of experts. To address the unique challenges of vision, a spatial smoothness regulariser is introduced that penalises routing irregularities between adjacent patch tokens, turning patch correlation into a training signal. Across image classification tasks on CIFAR-100, ButterflyViT achieves 354$\times$ memory reduction at 64 experts with negligible accuracy loss. ButterflyViT allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.

ButterflyViT: 354$\times$ Expert Compression for Edge Vision Transformers

TL;DR

ButterflyViT is introduced, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate that allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.

Abstract

Deploying sparse Mixture of Experts(MoE) Vision Transformers remains a challenge due to linear expert memory scaling. Linear memory scaling stores independent expert weight matrices requiring memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyViT, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields memory which is sub-linear in the number of experts. To address the unique challenges of vision, a spatial smoothness regulariser is introduced that penalises routing irregularities between adjacent patch tokens, turning patch correlation into a training signal. Across image classification tasks on CIFAR-100, ButterflyViT achieves 354 memory reduction at 64 experts with negligible accuracy loss. ButterflyViT allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.
Paper Structure (31 sections, 2 theorems, 14 equations, 7 figures, 2 tables, 2 algorithms)

This paper contains 31 sections, 2 theorems, 14 equations, 7 figures, 2 tables, 2 algorithms.

Key Result

Proposition 1

For $N_E$ experts with dimensions $d_{\text{model}}$, $d_{\text{ff}}$, and $n_\ell$ butterfly layers, ButterflyViT expert memory is:

Figures (7)

  • Figure 1: Standard ViT-MoE architecture
  • Figure 2: Top-$k$ gating instantiates experts via lightweight rotations of a shared ternary base matrix $\mathbf{W}_{\text{base}}$ .
  • Figure 3: Training Curves and Validation loss visualised.
  • Figure 4: (Left) Number of experts is compared to the MoE expert memory. (Right) Expert Parameter Compression Ratio
  • Figure 5: Cosine Similarity showing the similarity score across 8 experts.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition 1: Memory Scaling
  • Proposition 2: Compression Lower Bound