Table of Contents
Fetching ...

Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

Haozhe Liu, Ding Liu, Mingchen Zhuge, Zijian Zhou, Tian Xie, Sen He, Yukang Yang, Shuming Liu, Yuren Cong, Jiadong Guo, Hongyu Xu, Ke Xu, Kam-Woh Ng, Juan C. Pérez, Juan-Manuel~Pérez-Rúa, Tao Xiang, Wei Liu, Shikun Liu, Jürgen Schmidhuber

TL;DR

MoS introduces a token-wise, learnable router that enables dynamic, sparse, and state-dependent fusion across asymmetric multimodal transformers in diffusion models. By routing contextual features from a frozen understanding tower to a trainable generation tower at each denoising step, MoS achieves state-of-the-art results on text-to-image generation and image editing with only 3–5B parameters, substantially outperforming larger baselines in efficiency. The method emphasizes adaptive, token-specific routing, top-$k$ sparsity with an $\epsilon$-greedy exploration, and a lightweight router to maintain practicality. Extensive ablations validate the necessity of dynamic conditioning, token-level routing, and adaptive layer selection, while scaling experiments demonstrate strong performance even with reduced compute and staged training. The work presents MoS as a flexible, compute-efficient paradigm for scaling multimodal diffusion, with promising directions for dual-way fusion, alignment with human preferences, and further efficiency and interpretability improvements.

Abstract

We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-$k$ hidden states and is trained with an $ε$-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to $4\times$ larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.

Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

TL;DR

MoS introduces a token-wise, learnable router that enables dynamic, sparse, and state-dependent fusion across asymmetric multimodal transformers in diffusion models. By routing contextual features from a frozen understanding tower to a trainable generation tower at each denoising step, MoS achieves state-of-the-art results on text-to-image generation and image editing with only 3–5B parameters, substantially outperforming larger baselines in efficiency. The method emphasizes adaptive, token-specific routing, top- sparsity with an -greedy exploration, and a lightweight router to maintain practicality. Extensive ablations validate the necessity of dynamic conditioning, token-level routing, and adaptive layer selection, while scaling experiments demonstrate strong performance even with reduced compute and staged training. The work presents MoS as a flexible, compute-efficient paradigm for scaling multimodal diffusion, with promising directions for dual-way fusion, alignment with human preferences, and further efficiency and interpretability improvements.

Abstract

We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top- hidden states and is trained with an -greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.

Paper Structure

This paper contains 44 sections, 4 equations, 18 figures, 11 tables, 2 algorithms.

Figures (18)

  • Figure 1: Generation examples by MoS-Image (left) and MoS-Edit (right). MoS introduces a learnable, token-wise router that efficiently aggregates feature states across modalities. This allows for high-quality visual synthesis, producing photorealistic and stylized outputs from text and image inputs with precise control and quality.
  • Figure 2: MoS enables sparse and dynamic interactions across modalities and transformers. We illustrate MoS with text-to-image generation. Previous approaches, such as (a) cross-attention and (b) self-attention, typically provide only the final text encoder block's embedding as input to the visual branch, limiting the richness of cross-modal information. (c) MoT (Mixture-of-Transformers) attempts finer-grained interaction by passing outputs from all text blocks in a rigid, layer-by-layer fashion. In contrast, our proposed (d) MoS (Mixture of States) employs a learnable sparse interaction that dynamically links any text block to any visual block. The routing adapts to the current input, comprising the text prompt, visual latents, and denoising step embeddings, enabling flexible and efficient multimodal fusion.
  • Figure 3: MoS Design Details. MoS introduces a new paradigm for multimodal interaction within transformer architectures. Rather than depending on manually designed fusion strategies, MoS employs a learned router to establish token-level sparse and dynamic connections between transformer blocks. For illustration, we use image generation as the running example and thus refer to the understanding-tower features as textual embeddings.
  • Figure 3: FID and CLIP results on MJHQ comparing hand-crafted routing and MoS. MoS significantly outperforms the hand-crafted routing baseline.
  • Figure 4: Image Editing Inference Pipeline. Both the understanding and generation towers take the reference image as input, with their interaction facilitated through the MoS module.
  • ...and 13 more figures