Table of Contents
Fetching ...

Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks

Uranik Berisha, Jens Mehnert, Alexandru Paul Condurache

TL;DR

This work tackles the high computational cost of Vision Transformers by proposing a post-training extraction of Mixture-of-Experts (MoE) from pretrained networks. It uses activation clustering via HDBSCAN to identify activation patterns per layer and variance-based neuron selection to form data-driven expert subnetworks, with a lightweight cosine-based routing to mean input tokens. The resulting MoEE variants achieve substantial MACs and parameter reductions (up to ~36% and ~32%, respectively) while retaining most of the original performance after minimal fine-tuning on ImageNet-1k, especially for larger models, and generalize across architectures such as Swin and ConvNeXt. The approach provides deep insights into activation modularity, routing distributions, and expert formations, offering a practical, scalable path to efficient ViT deployment without full retraining.

Abstract

Vision Transformers have emerged as the state-of-the-art models in various Computer Vision tasks, but their high computational and resource demands pose significant challenges. While Mixture-of-Experts (MoE) can make these models more efficient, they often require costly retraining or even training from scratch. Recent developments aim to reduce these computational costs by leveraging pretrained networks. These have been shown to produce sparse activation patterns in the Multi-Layer Perceptrons (MLPs) of the encoder blocks, allowing for conditional activation of only relevant subnetworks for each sample. Building on this idea, we propose a new method to construct MoE variants from pretrained models. Our approach extracts expert subnetworks from the model's MLP layers post-training in two phases. First, we cluster output activations to identify distinct activation patterns. In the second phase, we use these clusters to extract the corresponding subnetworks responsible for producing them. On ImageNet-1k recognition tasks, we demonstrate that these extracted experts can perform surprisingly well out of the box and require only minimal fine-tuning to regain 98% of the original performance, all while reducing MACs and model size, by up to 36% and 32% respectively.

Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks

TL;DR

This work tackles the high computational cost of Vision Transformers by proposing a post-training extraction of Mixture-of-Experts (MoE) from pretrained networks. It uses activation clustering via HDBSCAN to identify activation patterns per layer and variance-based neuron selection to form data-driven expert subnetworks, with a lightweight cosine-based routing to mean input tokens. The resulting MoEE variants achieve substantial MACs and parameter reductions (up to ~36% and ~32%, respectively) while retaining most of the original performance after minimal fine-tuning on ImageNet-1k, especially for larger models, and generalize across architectures such as Swin and ConvNeXt. The approach provides deep insights into activation modularity, routing distributions, and expert formations, offering a practical, scalable path to efficient ViT deployment without full retraining.

Abstract

Vision Transformers have emerged as the state-of-the-art models in various Computer Vision tasks, but their high computational and resource demands pose significant challenges. While Mixture-of-Experts (MoE) can make these models more efficient, they often require costly retraining or even training from scratch. Recent developments aim to reduce these computational costs by leveraging pretrained networks. These have been shown to produce sparse activation patterns in the Multi-Layer Perceptrons (MLPs) of the encoder blocks, allowing for conditional activation of only relevant subnetworks for each sample. Building on this idea, we propose a new method to construct MoE variants from pretrained models. Our approach extracts expert subnetworks from the model's MLP layers post-training in two phases. First, we cluster output activations to identify distinct activation patterns. In the second phase, we use these clusters to extract the corresponding subnetworks responsible for producing them. On ImageNet-1k recognition tasks, we demonstrate that these extracted experts can perform surprisingly well out of the box and require only minimal fine-tuning to regain 98% of the original performance, all while reducing MACs and model size, by up to 36% and 32% respectively.

Paper Structure

This paper contains 33 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Illustration of the expert extraction process. The process begins with the clustering of activations (a), followed by the extraction of experts from these clusters (b), and finishes with the inference stage, using the extracted experts (c).
  • Figure 2: Sorted routing distributions across experts for all classes at different layers, demonstrating a more balanced routing load compared to ViT-MoEfication across different layers, effectively reducing the need for additional load-balancing terms.
  • Figure 3: Token routing distributions at different layers for randomly selected classes (goldfish, pug, plane, and cliff) compared to the distribution across all classes (all). Layer 6 shows a relatively even distribution across experts, indicating less class-specific specialization, Layer 11 shows distinct spikes in the routings, as tokens are more selectively routed based on class-specific features.
  • Figure 4: Token routing distributions at different layers for visually similar truck-like classes (fire engine, garbage truck, pickup, and tow truck) compared to the distribution across all classes (all). Both layers shows similar distributions across experts for all truck-like classes. This indicates a processing through similar expert selections, reflecting the effectiveness of the routing mechanism.
  • Figure 5: Similarity matrices of mean expert inputs at different layers, illustrating the relationships between inputs of different experts. Layer 6 shows high similarities, with several experts' inputs being closely related, indicating less specialization. Layer 11 exhibit distinct differences, with some experts showing near-orthogonal inputs, indicating higher specialization.
  • ...and 5 more figures