ButterflyViT: 354$\times$ Expert Compression for Edge Vision Transformers

Aryan Karmore

ButterflyViT: 354$\times$ Expert Compression for Edge Vision Transformers

Aryan Karmore

TL;DR

ButterflyViT is introduced, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate that allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.

Abstract

Deploying sparse Mixture of Experts(MoE) Vision Transformers remains a challenge due to linear expert memory scaling. Linear memory scaling stores $N$ independent expert weight matrices requiring $\mathcal{O}(N_E \cdot d^2)$ memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyViT, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields $\mathcal{O}(d_{\text{model}} \cdot d_{\text{ff}} + N_E \cdot n_\ell \cdot d)$ memory which is sub-linear in the number of experts. To address the unique challenges of vision, a spatial smoothness regulariser is introduced that penalises routing irregularities between adjacent patch tokens, turning patch correlation into a training signal. Across image classification tasks on CIFAR-100, ButterflyViT achieves 354$\times$ memory reduction at 64 experts with negligible accuracy loss. ButterflyViT allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.

ButterflyViT: 354$\times$ Expert Compression for Edge Vision Transformers

TL;DR

Abstract

Deploying sparse Mixture of Experts(MoE) Vision Transformers remains a challenge due to linear expert memory scaling. Linear memory scaling stores

independent expert weight matrices requiring

memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyViT, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields

memory which is sub-linear in the number of experts. To address the unique challenges of vision, a spatial smoothness regulariser is introduced that penalises routing irregularities between adjacent patch tokens, turning patch correlation into a training signal. Across image classification tasks on CIFAR-100, ButterflyViT achieves 354

memory reduction at 64 experts with negligible accuracy loss. ButterflyViT allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.

Paper Structure (31 sections, 2 theorems, 14 equations, 7 figures, 2 tables, 2 algorithms)

This paper contains 31 sections, 2 theorems, 14 equations, 7 figures, 2 tables, 2 algorithms.

Introduction
Literature Review
ViT architectures
Model Compression and Quantization
Sparse MoE for ViT
ButterflyViT
Methodology
Problem Setup
Why Standard MoE fails on Edge Devices
Core Insight: Experts as Orbits of a Quantized Prototype
Parameterization via Butterfly Matrices
Complexity
Quantized Substrate and Outlier Suppression
Ternary Quantization
Straight-Through Estimator (STE)
...and 16 more sections

Key Result

Proposition 1

For $N_E$ experts with dimensions $d_{\text{model}}$, $d_{\text{ff}}$, and $n_\ell$ butterfly layers, ButterflyViT expert memory is:

Figures (7)

Figure 1: Standard ViT-MoE architecture
Figure 2: Top-$k$ gating instantiates experts via lightweight rotations of a shared ternary base matrix $\mathbf{W}_{\text{base}}$ .
Figure 3: Training Curves and Validation loss visualised.
Figure 4: (Left) Number of experts is compared to the MoE expert memory. (Right) Expert Parameter Compression Ratio
Figure 5: Cosine Similarity showing the similarity score across 8 experts.
...and 2 more figures

Theorems & Definitions (2)

Proposition 1: Memory Scaling
Proposition 2: Compression Lower Bound

ButterflyViT: 354$\times$ Expert Compression for Edge Vision Transformers

TL;DR

Abstract

ButterflyViT: 354$\times$ Expert Compression for Edge Vision Transformers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (2)