MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation
Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, Qi She, Chang Liu, Zhenbang Sun
TL;DR
Mammoth2 introduces a unified autoregressive--diffusion framework that couples semantic planning over discrete visual tokens with high-fidelity image synthesis, enabling text-to-image generation and instruction-based editing without sacrificing multimodal understanding. The architecture employs generation experts within an autoregressive backbone and a single-stream Diffusion Transformer, connected through a three-stage AR--Diffusion feature alignment module and a unified visual tokenizer (MammothTok). A progressive, multi-stage training curriculum combines Next-Token Prediction, Flow Matching, supervised fine-tuning, and DiffusionNFT-based reinforcement learning across generation and editing tasks, achieving strong results on GenEval (≈0.87), DPGBench (≈87.2), and ImgEdit (≈4.06) while maintaining competitive understanding performance with around 60M generation samples and no pre-trained generators. The approach demonstrates data- and parameter-efficiency, competitive multimodal understanding, and robust editing capabilities, suggesting that tightly coupled AR and diffusion pathways can deliver unified, controllable, high-quality multimodal intelligence in a single framework. Key contributions include the AR--Diffusion feature alignment module, MammothTok visual tokenizer, multi-layer AR feature aggregation, a hybrid reward diffusion fine-tuning regime, and a scalable training strategy validated on diverse generation, editing, and understanding benchmarks.
Abstract
Unified multimodal models aim to integrate understanding and generation within a single framework, yet bridging the gap between discrete semantic reasoning and high-fidelity visual synthesis remains challenging. We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework designed to effectively couple autoregressive semantic planning with diffusion-based generation. Mammoth2 adopts a serial design: an AR path equipped with generation experts performs global semantic modeling over discrete tokens, while a single-stream Diffusion Transformer (DiT) decoder handles high-fidelity image synthesis. A carefully designed AR-Diffusion feature alignment module combines multi-layer feature aggregation, unified condition encoding, and in-context conditioning to stably align AR's representations with the diffusion decoder's continuous latents. Mammoth2 is trained end-to-end with joint Next-Token Prediction and Flow Matching objectives, followed by supervised fine-tuning and reinforcement learning over both generation and editing. With roughly 60M supervised generation samples and no reliance on pre-trained generators, Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks, achieving 0.87 on GenEval, 87.2 on DPGBench, and 4.06 on ImgEdit, while remaining competitive with understanding-only backbones (e.g., Qwen3-VL-8B) on multimodal understanding tasks. These results suggest that a carefully coupled AR-Diffusion architecture can provide high-fidelity generation and editing while maintaining strong multimodal comprehension within a single, parameter- and data-efficient model.
