Table of Contents
Fetching ...

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Gagan Jain, Nidhi Hegde, Aditya Kusupati, Arsha Nagrani, Shyamal Buch, Prateek Jain, Anurag Arnab, Sujoy Paul

TL;DR

MoNE addresses the inefficiency of uniform token processing in Vision Transformers by introducing a nested-expert architecture whose capacity increases across a spectrum of compute levels. A learnable router, combined with Expert Preferred Routing (EPR) and a capacity-distribution optimization, dynamically assigns tokens to nested models under a target effective capacity $e_c$, preserving the baseline parameter count. Empirical results on ImageNet-21K, Kinetics-400, and Something-Something-v2 show MoNE achieves comparable accuracy with about $2\times$ to $3\times$ reductions in inference-time FLOPs, especially benefiting from redundancy in video data. The approach also supports budget-adaptive inference with a single trained model, and visualizations corroborate that important tokens are routed to larger, more capable experts.

Abstract

The visual medium (images and videos) naturally contains a large amount of information redundancy, thereby providing a great opportunity for leveraging efficiency in processing. While Vision Transformer (ViT) based models scale effectively to large data regimes, they fail to capitalize on this inherent redundancy, leading to higher computational costs. Mixture of Experts (MoE) networks demonstrate scalability while maintaining same inference-time costs, but they come with a larger parameter footprint. We present Mixture of Nested Experts (MoNE), which utilizes a nested structure for experts, wherein individual experts fall on an increasing compute-accuracy curve. Given a compute budget, MoNE learns to dynamically choose tokens in a priority order, and thus redundant tokens are processed through cheaper nested experts. Using this framework, we achieve equivalent performance as the baseline models, while reducing inference time compute by over two-fold. We validate our approach on standard image and video datasets - ImageNet-21K, Kinetics400, and Something-Something-v2. We further highlight MoNE$'$s adaptability by showcasing its ability to maintain strong performance across different inference-time compute budgets on videos, using only a single trained model.

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

TL;DR

MoNE addresses the inefficiency of uniform token processing in Vision Transformers by introducing a nested-expert architecture whose capacity increases across a spectrum of compute levels. A learnable router, combined with Expert Preferred Routing (EPR) and a capacity-distribution optimization, dynamically assigns tokens to nested models under a target effective capacity , preserving the baseline parameter count. Empirical results on ImageNet-21K, Kinetics-400, and Something-Something-v2 show MoNE achieves comparable accuracy with about to reductions in inference-time FLOPs, especially benefiting from redundancy in video data. The approach also supports budget-adaptive inference with a single trained model, and visualizations corroborate that important tokens are routed to larger, more capable experts.

Abstract

The visual medium (images and videos) naturally contains a large amount of information redundancy, thereby providing a great opportunity for leveraging efficiency in processing. While Vision Transformer (ViT) based models scale effectively to large data regimes, they fail to capitalize on this inherent redundancy, leading to higher computational costs. Mixture of Experts (MoE) networks demonstrate scalability while maintaining same inference-time costs, but they come with a larger parameter footprint. We present Mixture of Nested Experts (MoNE), which utilizes a nested structure for experts, wherein individual experts fall on an increasing compute-accuracy curve. Given a compute budget, MoNE learns to dynamically choose tokens in a priority order, and thus redundant tokens are processed through cheaper nested experts. Using this framework, we achieve equivalent performance as the baseline models, while reducing inference time compute by over two-fold. We validate our approach on standard image and video datasets - ImageNet-21K, Kinetics400, and Something-Something-v2. We further highlight MoNEs adaptability by showcasing its ability to maintain strong performance across different inference-time compute budgets on videos, using only a single trained model.
Paper Structure (15 sections, 7 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 15 sections, 7 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: MoNE's learned token importance: From left to right, fewer image tokens are processed using the full model -- to fit a compute budget -- by an increasing threshold on MoNE's router logits.
  • Figure 2: (a) Nested model: Partial in- and out-projections in the SA and MLP layers create nested models. $m$ controls the parameter count and the FLOPs of nested models. The self-attention information exchange happens at the full model dimension $D$, MLP dimension is set to $4D$ as in ViT. (b) Mixture of Nested Experts (MoNE): Each token $\mathbf{x}$ is routed to a nested network, denoted by different model dimension in the diagram. Here $\mathbf{x}_i$ gets routed to a nested model with model dimension $D/4$, whereas $\mathbf{x}_{i+1}$ gets to the full model. The information exchange between these tokens of different dimension happens in the self-attention block, where they are always projected to the same dimension. The router weights are also multiplied with the features for proper flow of gradients. A lighter color in the weight matrix indicate a sliced matrix to construct the nestedness.
  • Figure 3: Image classification: Performance comparison of MoNE with baselines on ImageNet-21k for different model sizes. MoNE performs significantly better than MatViT and Mixture-of-Depth (MoD) and even benefits from isoFLOPs training (see fig a).
  • Figure 4: Video classification: MoNE vs. baselines on video datasets. Finetuning with the isoFLOPs training regime leads to matching baseline with $>2\times$ FLOP improvement.
  • Figure 5: Capacity adaptation during inference: Performance changes when a model trained at a certain capacity (denoted as $\filledstar$) is evaluated at other capacities. The "Train Adaptive" plot for SSv2 denotes a single model evaluated at different inference-time budgets.
  • ...and 3 more figures