Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
Haozhe Liu, Ding Liu, Mingchen Zhuge, Zijian Zhou, Tian Xie, Sen He, Yukang Yang, Shuming Liu, Yuren Cong, Jiadong Guo, Hongyu Xu, Ke Xu, Kam-Woh Ng, Juan C. Pérez, Juan-Manuel~Pérez-Rúa, Tao Xiang, Wei Liu, Shikun Liu, Jürgen Schmidhuber
TL;DR
MoS introduces a token-wise, learnable router that enables dynamic, sparse, and state-dependent fusion across asymmetric multimodal transformers in diffusion models. By routing contextual features from a frozen understanding tower to a trainable generation tower at each denoising step, MoS achieves state-of-the-art results on text-to-image generation and image editing with only 3–5B parameters, substantially outperforming larger baselines in efficiency. The method emphasizes adaptive, token-specific routing, top-$k$ sparsity with an $\epsilon$-greedy exploration, and a lightweight router to maintain practicality. Extensive ablations validate the necessity of dynamic conditioning, token-level routing, and adaptive layer selection, while scaling experiments demonstrate strong performance even with reduced compute and staged training. The work presents MoS as a flexible, compute-efficient paradigm for scaling multimodal diffusion, with promising directions for dual-way fusion, alignment with human preferences, and further efficiency and interpretability improvements.
Abstract
We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-$k$ hidden states and is trained with an $ε$-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to $4\times$ larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.
