Dynamic Mixture-of-Experts for Visual Autoregressive Model
Jort Vincenti, Metod Jazbec, Guoxuan Xia
TL;DR
Visual Autoregressive Models (VAR) achieve high-quality image generation but incur substantial compute due to repeated transformer calls across scales. The paper introduces a dynamic Mixture-of-Experts (MoE) router with scale-aware thresholding (D2DMoE) that sparsifies Transformer feed-forward networks by activating fewer experts per token and per scale, without retraining. On ImageNet, the method reduces FLOPs by about 19% and speeds up inference by roughly 11% while maintaining FID within 1% of the dense VAR baseline, by concentrating MoE sparsification on the last three refinement scales. This demonstrates that exploiting redundancy across both spatial tokens and hierarchical scales can yield substantial efficiency gains for high-quality visual autoregressive generation, with potential for further gains via targeted fine-tuning and improved routing.
Abstract
Visual Autoregressive Models (VAR) offer efficient and high-quality image generation but suffer from computational redundancy due to repeated Transformer calls at increasing resolutions. We introduce a dynamic Mixture-of-Experts router integrated into VAR. The new architecture allows to trade compute for quality through scale-aware thresholding. This thresholding strategy balances expert selection based on token complexity and resolution, without requiring additional training. As a result, we achieve 20% fewer FLOPs, 11% faster inference and match the image quality achieved by the dense baseline.
