Table of Contents
Fetching ...

Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo, Chaojie Mao, Xiaotang Gai, Xi Chen, Jingfeng Zhang, Yulin Pan, Zhen Han, Jie Xiao, Keyu Yan, Chenwei Xie, Chongyang Zhong, Kai Zhu, Tong Shen, Lianghua Huang, Yu Liu, Yujiu Yang

Abstract

Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model's capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.

Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

Abstract

Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model's capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.

Paper Structure

This paper contains 33 sections, 1 equation, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Overview of the inference process of Wan-Weaver. Given a prompt, the planner expert autoregressively generates plain text and dense prompts as visualization cues. Through causal multi-modal self-attention, the visualizer interacts with the planner, enabling it to synthesize images conditioned on the dense prompt context and visual references. The resulting text–image outputs are appended to the history and fed back into the planner, enabling an iterative interleaved generation process that maintains long-range contextual coherence.
  • Figure 2: Illustration of our decoupled training strategy.
  • Figure 3: Statistics of WeaverBench. (a) Topic distribution across 14 everyday categories. (b) Prompt length distribution. (c) Distribution of the number of images requested per prompt.
  • Figure 4: Qualitative comparison with the state-of-the-art commercial system Nano Banana on interleaved text–image generation.
  • Figure 5: Loss curves of different training strategies.
  • ...and 11 more figures