Table of Contents
Fetching ...

$μ$-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts

Toshiaki Koike-Akino, Jing Liu, Ye Wang

TL;DR

The paper tackles the high computational cost of large foundation models by introducing μ-MoE, a test-time, activation-aware pruning framework that activates a vast pool of micro-experts (single-parameter weights) on a per-prompt basis. By leveraging online Wanda pruning, it achieves fine-grained, prompt-dependent sparsity with limited overhead, addressing domain-shift issues inherent to offline calibration. Across OPT perplexity benchmarks and multimodal tasks with LLaVA-7B, μ-MoE consistently outperforms offline pruning methods, especially at moderate active-weight ratios, demonstrating robust, task-adaptive sparsity. The work presents a practical path toward scalable, per-prompt efficient inference for large language models and multimodal systems.

Abstract

To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these rely on calibration data, domain shift may arise for unknown downstream tasks. With a computationally efficient calibration, activation-aware pruning can be executed for every prompt adaptively, yet achieving reduced complexity at inference. We formulate it as a mixture of micro-experts, called $μ$-MoE. Several experiments demonstrate that $μ$-MoE can dynamically adapt to task/prompt-dependent structured sparsity on the fly.

$μ$-MoE: Test-Time Pruning as Micro-Grained Mixture-of-Experts

TL;DR

The paper tackles the high computational cost of large foundation models by introducing μ-MoE, a test-time, activation-aware pruning framework that activates a vast pool of micro-experts (single-parameter weights) on a per-prompt basis. By leveraging online Wanda pruning, it achieves fine-grained, prompt-dependent sparsity with limited overhead, addressing domain-shift issues inherent to offline calibration. Across OPT perplexity benchmarks and multimodal tasks with LLaVA-7B, μ-MoE consistently outperforms offline pruning methods, especially at moderate active-weight ratios, demonstrating robust, task-adaptive sparsity. The work presents a practical path toward scalable, per-prompt efficient inference for large language models and multimodal systems.

Abstract

To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these rely on calibration data, domain shift may arise for unknown downstream tasks. With a computationally efficient calibration, activation-aware pruning can be executed for every prompt adaptively, yet achieving reduced complexity at inference. We formulate it as a mixture of micro-experts, called -MoE. Several experiments demonstrate that -MoE can dynamically adapt to task/prompt-dependent structured sparsity on the fly.

Paper Structure

This paper contains 30 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Coarse to micro-grained MoE.
  • Figure 2: Offline vs online pruning: dynamic pruning finds prompt-dependent sparse structure at test time, preventing domain shift.
  • Figure 3: Wanda pruning complexity based on torch.sort/topk/kthvalue on CPU and GPU at $\rho=0.25, 0.50, 0.75$.
  • Figure 4: Perplexity results averaged over WT2, PTB, and C4 datasets for compressed OPT models.

Theorems & Definitions (1)

  • Remark 2.1