Table of Contents
Fetching ...

MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation

Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, Qi She, Chang Liu, Zhenbang Sun

TL;DR

Mammoth2 introduces a unified autoregressive--diffusion framework that couples semantic planning over discrete visual tokens with high-fidelity image synthesis, enabling text-to-image generation and instruction-based editing without sacrificing multimodal understanding. The architecture employs generation experts within an autoregressive backbone and a single-stream Diffusion Transformer, connected through a three-stage AR--Diffusion feature alignment module and a unified visual tokenizer (MammothTok). A progressive, multi-stage training curriculum combines Next-Token Prediction, Flow Matching, supervised fine-tuning, and DiffusionNFT-based reinforcement learning across generation and editing tasks, achieving strong results on GenEval (≈0.87), DPGBench (≈87.2), and ImgEdit (≈4.06) while maintaining competitive understanding performance with around 60M generation samples and no pre-trained generators. The approach demonstrates data- and parameter-efficiency, competitive multimodal understanding, and robust editing capabilities, suggesting that tightly coupled AR and diffusion pathways can deliver unified, controllable, high-quality multimodal intelligence in a single framework. Key contributions include the AR--Diffusion feature alignment module, MammothTok visual tokenizer, multi-layer AR feature aggregation, a hybrid reward diffusion fine-tuning regime, and a scalable training strategy validated on diverse generation, editing, and understanding benchmarks.

Abstract

Unified multimodal models aim to integrate understanding and generation within a single framework, yet bridging the gap between discrete semantic reasoning and high-fidelity visual synthesis remains challenging. We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework designed to effectively couple autoregressive semantic planning with diffusion-based generation. Mammoth2 adopts a serial design: an AR path equipped with generation experts performs global semantic modeling over discrete tokens, while a single-stream Diffusion Transformer (DiT) decoder handles high-fidelity image synthesis. A carefully designed AR-Diffusion feature alignment module combines multi-layer feature aggregation, unified condition encoding, and in-context conditioning to stably align AR's representations with the diffusion decoder's continuous latents. Mammoth2 is trained end-to-end with joint Next-Token Prediction and Flow Matching objectives, followed by supervised fine-tuning and reinforcement learning over both generation and editing. With roughly 60M supervised generation samples and no reliance on pre-trained generators, Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks, achieving 0.87 on GenEval, 87.2 on DPGBench, and 4.06 on ImgEdit, while remaining competitive with understanding-only backbones (e.g., Qwen3-VL-8B) on multimodal understanding tasks. These results suggest that a carefully coupled AR-Diffusion architecture can provide high-fidelity generation and editing while maintaining strong multimodal comprehension within a single, parameter- and data-efficient model.

MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation

TL;DR

Mammoth2 introduces a unified autoregressive--diffusion framework that couples semantic planning over discrete visual tokens with high-fidelity image synthesis, enabling text-to-image generation and instruction-based editing without sacrificing multimodal understanding. The architecture employs generation experts within an autoregressive backbone and a single-stream Diffusion Transformer, connected through a three-stage AR--Diffusion feature alignment module and a unified visual tokenizer (MammothTok). A progressive, multi-stage training curriculum combines Next-Token Prediction, Flow Matching, supervised fine-tuning, and DiffusionNFT-based reinforcement learning across generation and editing tasks, achieving strong results on GenEval (≈0.87), DPGBench (≈87.2), and ImgEdit (≈4.06) while maintaining competitive understanding performance with around 60M generation samples and no pre-trained generators. The approach demonstrates data- and parameter-efficiency, competitive multimodal understanding, and robust editing capabilities, suggesting that tightly coupled AR and diffusion pathways can deliver unified, controllable, high-quality multimodal intelligence in a single framework. Key contributions include the AR--Diffusion feature alignment module, MammothTok visual tokenizer, multi-layer AR feature aggregation, a hybrid reward diffusion fine-tuning regime, and a scalable training strategy validated on diverse generation, editing, and understanding benchmarks.

Abstract

Unified multimodal models aim to integrate understanding and generation within a single framework, yet bridging the gap between discrete semantic reasoning and high-fidelity visual synthesis remains challenging. We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework designed to effectively couple autoregressive semantic planning with diffusion-based generation. Mammoth2 adopts a serial design: an AR path equipped with generation experts performs global semantic modeling over discrete tokens, while a single-stream Diffusion Transformer (DiT) decoder handles high-fidelity image synthesis. A carefully designed AR-Diffusion feature alignment module combines multi-layer feature aggregation, unified condition encoding, and in-context conditioning to stably align AR's representations with the diffusion decoder's continuous latents. Mammoth2 is trained end-to-end with joint Next-Token Prediction and Flow Matching objectives, followed by supervised fine-tuning and reinforcement learning over both generation and editing. With roughly 60M supervised generation samples and no reliance on pre-trained generators, Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks, achieving 0.87 on GenEval, 87.2 on DPGBench, and 4.06 on ImgEdit, while remaining competitive with understanding-only backbones (e.g., Qwen3-VL-8B) on multimodal understanding tasks. These results suggest that a carefully coupled AR-Diffusion architecture can provide high-fidelity generation and editing while maintaining strong multimodal comprehension within a single, parameter- and data-efficient model.

Paper Structure

This paper contains 41 sections, 5 equations, 10 figures, 14 tables.

Figures (10)

  • Figure 1: Mammoth2 seamlessly handles text-to-image generation, instruction-based editing, and visual understanding within a single model, delivering high-fidelity results across diverse real-world scenarios.
  • Figure 2: Mammoth2 Architecture. A serial AR--Diffusion framework in which an autoregressive backbone performs semantic planning over MammothTok visual tokens, and a diffusion decoder generates high-fidelity images conditioned on the AR features.
  • Figure 3: AR--Diffusion feature alignment module. Multi-layer feature aggregation, unified condition encoding, and in-context conditioning enable seamless transition from AR feature outputs to diffusion feature inputs.
  • Figure 4: Multi-stage training strategy. Stage 1: generation pretraining; Stage 2: unified joint training with all parameters unfrozen (SFT); Stage 3: RL post-training (DiffusionNFT on the diffusion branch) with multi-signal rewards.
  • Figure 5: Noise Regularization on MammothTok Tokens. We visualize Region Noise (spatial patch corruption) and Similarity Noise (codebook-based token replacement) applied to MammothTok discrete tokens during training. Compared with Region Noise, Similarity Noise better preserves global structure while injecting realistic local variations, leading to more robust autoregressive trajectories under teacher forcing at inference time.
  • ...and 5 more figures