Table of Contents
Fetching ...

Emu3.5: Native Multimodal Models are World Learners

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, Xinlong Wang

TL;DR

Emu3.5 introduces a native multimodal world model trained with a unified next-token objective on over 10 trillion vision-language tokens, enabling long-horizon, interleaved generation across text and visuals. It combines two-stage large-scale pre-training, supervised fine-tuning, and reinforcement learning with a modular, Diffusion-inspired DiDA inference acceleration to achieve rapid, high-quality generation. The model demonstrates strong capabilities in text-to-image and any-to-image generation, visual narratives, visual guidance, world exploration, and embodied manipulation, surpassing several state-of-the-art baselines on multiple benchmarks and tasks. Open-sourced extensions include a robust tokenizer, diffusion-based decoders, and a unified training-and-inference framework that supports scalable cross-modal reasoning and open-world interaction. This work pushes toward a general-purpose multimodal world model with broad implications for multimodal AI agents and embodied systems.

Abstract

We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.

Emu3.5: Native Multimodal Models are World Learners

TL;DR

Emu3.5 introduces a native multimodal world model trained with a unified next-token objective on over 10 trillion vision-language tokens, enabling long-horizon, interleaved generation across text and visuals. It combines two-stage large-scale pre-training, supervised fine-tuning, and reinforcement learning with a modular, Diffusion-inspired DiDA inference acceleration to achieve rapid, high-quality generation. The model demonstrates strong capabilities in text-to-image and any-to-image generation, visual narratives, visual guidance, world exploration, and embodied manipulation, surpassing several state-of-the-art baselines on multiple benchmarks and tasks. Open-sourced extensions include a robust tokenizer, diffusion-based decoders, and a unified training-and-inference framework that supports scalable cross-modal reasoning and open-world interaction. This work pushes toward a general-purpose multimodal world model with broad implications for multimodal AI agents and embodied systems.

Abstract

We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.

Paper Structure

This paper contains 56 sections, 17 figures, 16 tables.

Figures (17)

  • Figure 3: Overview of the Emu3.5 architecture. The model is trained end-to-end at scale with a unified next-token prediction objective. During inference, single-token prediction is accelerated via discrete diffusion adaptation, enabling bidirectional parallel generation per image.
  • Figure 4: Overall training pipeline of Emu3.5.
  • Figure 5: Data statistics of video interleaved data.
  • Figure 6: Video interleaved data samples from Emu3.5's pre-training dataset.
  • Figure 7: Training and validation loss trends of Emu3.5 during the first stage of pre-training. The curves indicate that Emu3.5 achieves smooth and stable optimization, while maintaining consistent generalization across multiple validation datasets.
  • ...and 12 more figures