Table of Contents
Fetching ...

Dynamic Mixture-of-Experts for Visual Autoregressive Model

Jort Vincenti, Metod Jazbec, Guoxuan Xia

TL;DR

Visual Autoregressive Models (VAR) achieve high-quality image generation but incur substantial compute due to repeated transformer calls across scales. The paper introduces a dynamic Mixture-of-Experts (MoE) router with scale-aware thresholding (D2DMoE) that sparsifies Transformer feed-forward networks by activating fewer experts per token and per scale, without retraining. On ImageNet, the method reduces FLOPs by about 19% and speeds up inference by roughly 11% while maintaining FID within 1% of the dense VAR baseline, by concentrating MoE sparsification on the last three refinement scales. This demonstrates that exploiting redundancy across both spatial tokens and hierarchical scales can yield substantial efficiency gains for high-quality visual autoregressive generation, with potential for further gains via targeted fine-tuning and improved routing.

Abstract

Visual Autoregressive Models (VAR) offer efficient and high-quality image generation but suffer from computational redundancy due to repeated Transformer calls at increasing resolutions. We introduce a dynamic Mixture-of-Experts router integrated into VAR. The new architecture allows to trade compute for quality through scale-aware thresholding. This thresholding strategy balances expert selection based on token complexity and resolution, without requiring additional training. As a result, we achieve 20% fewer FLOPs, 11% faster inference and match the image quality achieved by the dense baseline.

Dynamic Mixture-of-Experts for Visual Autoregressive Model

TL;DR

Visual Autoregressive Models (VAR) achieve high-quality image generation but incur substantial compute due to repeated transformer calls across scales. The paper introduces a dynamic Mixture-of-Experts (MoE) router with scale-aware thresholding (D2DMoE) that sparsifies Transformer feed-forward networks by activating fewer experts per token and per scale, without retraining. On ImageNet, the method reduces FLOPs by about 19% and speeds up inference by roughly 11% while maintaining FID within 1% of the dense VAR baseline, by concentrating MoE sparsification on the last three refinement scales. This demonstrates that exploiting redundancy across both spatial tokens and hierarchical scales can yield substantial efficiency gains for high-quality visual autoregressive generation, with potential for further gains via targeted fine-tuning and improved routing.

Abstract

Visual Autoregressive Models (VAR) offer efficient and high-quality image generation but suffer from computational redundancy due to repeated Transformer calls at increasing resolutions. We introduce a dynamic Mixture-of-Experts router integrated into VAR. The new architecture allows to trade compute for quality through scale-aware thresholding. This thresholding strategy balances expert selection based on token complexity and resolution, without requiring additional training. As a result, we achieve 20% fewer FLOPs, 11% faster inference and match the image quality achieved by the dense baseline.

Paper Structure

This paper contains 14 sections, 3 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Generation pipeline.Left: The coarse-to-fine decoder of VAR performing next-scale prediction: it takes $(\mathtt{[s]},r_1,r_2,\dots,r_{K-1})$ as input to predict $(\hat{r}_1,\hat{r}_2,\dots,\hat{r}_K)$. Right: The FFN block is replaced with a dynamic-$k$ gating MoE module. It executes expert $E_j$ only when $\|E_j\|\!\ge\!{\color{red}\tau_s}\max_i\|E_i\|$, filtering the experts by $\ell_2$-norm. Because the thresholds grow with resolution $\tau_{1}<\dots<\tau_{S}$, many experts are used at coarse scales while fewer are consulted at fine scales.
  • Figure 2: Quality–efficiency trade-off and expert routing behaviour.(a) Qualitative comparison of DMoE-VAR and VAR samples. (b) Expert routing and generation patterns, shown across scales. First and Second row: generated images from the VAR baseline and DMoE-VAR. Third row: heat-maps of total experts allocated per token across layers. Fourth row: bar plots of the average number of experts used at each scale.
  • Figure 3: FID (↓) vs GFLOP (↓). Different optimization methods utilized with VAR.
  • Figure 4: FID (↓) vs GFLOP (↓). MoEfication applied to a single scale.
  • Figure 5: FID (↓) vs GFLOP (↓). DMoE-VAR integrated into different model depths.
  • ...and 7 more figures