$μ$-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts
Toshiaki Koike-Akino, Jing Liu, Ye Wang
TL;DR
The paper tackles the high computational cost of large foundation models by introducing μ-MoE, a test-time, activation-aware pruning framework that activates a vast pool of micro-experts (single-parameter weights) on a per-prompt basis. By leveraging online Wanda pruning, it achieves fine-grained, prompt-dependent sparsity with limited overhead, addressing domain-shift issues inherent to offline calibration. Across OPT perplexity benchmarks and multimodal tasks with LLaVA-7B, μ-MoE consistently outperforms offline pruning methods, especially at moderate active-weight ratios, demonstrating robust, task-adaptive sparsity. The work presents a practical path toward scalable, per-prompt efficient inference for large language models and multimodal systems.
Abstract
To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these rely on calibration data, domain shift may arise for unknown downstream tasks. With a computationally efficient calibration, activation-aware pruning can be executed for every prompt adaptively, yet achieving reduced complexity at inference. We formulate it as a mixture of micro-experts, called $μ$-MoE. Several experiments demonstrate that $μ$-MoE can dynamically adapt to task/prompt-dependent structured sparsity on the fly.
