Table of Contents
Fetching ...

Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization

James Oldfield, Markos Georgopoulos, Grigorios G. Chrysos, Christos Tzelepis, Yannis Panagakis, Mihalis A. Nicolaou, Jiankang Deng, Ioannis Patras

TL;DR

The Multilinear Mixture of Experts ($\mu$MoE) layer is proposed, focusing on vision models, and both qualitative and quantitative evidence is presented that scaling $\mu$MoE layers when fine-tuning foundation models for vision tasks leads to more specialized experts at the class-level, further enabling manual bias correction in CelebA attribute classification.

Abstract

The Mixture of Experts (MoE) paradigm provides a powerful way to decompose dense layers into smaller, modular computations often more amenable to human interpretation, debugging, and editability. However, a major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. In this paper, we propose the Multilinear Mixture of Experts ($μ$MoE) layer to address this, focusing on vision models. $μ$MoE layers enable scalable expert specialization by performing an implicit computation on prohibitively large weight tensors entirely in factorized form. Consequently, $μ$MoEs (1) avoid the restrictively high inference-time costs of dense MoEs, yet (2) do not inherit the training issues of the popular sparse MoEs' discrete (non-differentiable) expert routing. We present both qualitative and quantitative evidence that scaling $μ$MoE layers when fine-tuning foundation models for vision tasks leads to more specialized experts at the class-level, further enabling manual bias correction in CelebA attribute classification. Finally, we show qualitative results demonstrating the expert specialism achieved when pre-training large GPT2 and MLP-Mixer models with parameter-matched $μ$MoE blocks at every layer, maintaining comparable accuracy. Our code is available at: https://github.com/james-oldfield/muMoE.

Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization

TL;DR

The Multilinear Mixture of Experts (MoE) layer is proposed, focusing on vision models, and both qualitative and quantitative evidence is presented that scaling MoE layers when fine-tuning foundation models for vision tasks leads to more specialized experts at the class-level, further enabling manual bias correction in CelebA attribute classification.

Abstract

The Mixture of Experts (MoE) paradigm provides a powerful way to decompose dense layers into smaller, modular computations often more amenable to human interpretation, debugging, and editability. However, a major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. In this paper, we propose the Multilinear Mixture of Experts (MoE) layer to address this, focusing on vision models. MoE layers enable scalable expert specialization by performing an implicit computation on prohibitively large weight tensors entirely in factorized form. Consequently, MoEs (1) avoid the restrictively high inference-time costs of dense MoEs, yet (2) do not inherit the training issues of the popular sparse MoEs' discrete (non-differentiable) expert routing. We present both qualitative and quantitative evidence that scaling MoE layers when fine-tuning foundation models for vision tasks leads to more specialized experts at the class-level, further enabling manual bias correction in CelebA attribute classification. Finally, we show qualitative results demonstrating the expert specialism achieved when pre-training large GPT2 and MLP-Mixer models with parameter-matched MoE blocks at every layer, maintaining comparable accuracy. Our code is available at: https://github.com/james-oldfield/muMoE.
Paper Structure (69 sections, 17 equations, 28 figures, 10 tables)

This paper contains 69 sections, 17 equations, 28 figures, 10 tables.

Figures (28)

  • Figure 1: Benefits of the proposed $\mu$MoEs' model form over existing MoEs.
  • Figure 1: The forward pass of an (unfactorized) $\mu$MoE layer as a series of tensor contractions: the experts' weight matrices (yellow $2$D slices) are matrix-multiplied with the input vector and summed (weighted by the red expert coefficients).
  • Figure 2: Specialization in $256$ vs $32$ total expert CP$\mu$MoE layers (fine-tuned on CLIP ViT-B-32). Each row displays randomly selected images processed (with coefficient $\geq0.5$) by the first few experts for the two models. The more we scale the expert count, the greater the apparent expert specialism (to single visual themes or image categories).
  • Figure 3: Higher expert counts lead to more monosemantic experts: mean expert class-level polysemanticity of \ref{['eq:polysemanticity']} ($\downarrow$) as a function of the total number of experts. Results are shown for both CLIP ViT-B-32 and DINO models fine-tuned on ImageNET1k with CP$\mu$MoE layers.
  • Figure 4: Top-activating patches (top rows) and their full images (second rows) for the first 3 experts across 2 CP$\mu$MoE-e64 layers in $\mu$MoE MLP-mixer tolstikhin2021mlp models--$\mu$MoE blocks exhibit coarse-grained specialism (e.g. texture) earlier and more fine-grained specialism (e.g. objects) deeper in the network.
  • ...and 23 more figures