Table of Contents
Fetching ...

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

Yangming Shi, Shixiang Zhu, Tao Shen, Zhimiao Yu, Dengsheng Chen, Taicai Chen, Yunfei Yang, Juan Zhou, Chen Cheng, Liang Ma, Xibin Wu, Benxuan Yan, Ge Li, Tuoyu Zhang, Dan Li, Chang Liu, Zhenbang Sun

Abstract

We present Mamoda2.5, a unified AR-Diffusion framework that seamlessly integrates multimodal understanding and generation within a single architecture. To efficiently enhance the model's generation capability, we equip the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts (MoE) design (128 experts, Top-8 routing), yielding a 25B-parameter model that activates only 3B parameters, significantly reducing training costs while scaling up the model capacity. Mamoda2.5 achieves top-tier generation performance on VBench 2.0 and sets a new record in video editing quality, surpassing evaluated open-source models and matching the performance of current top-tier proprietary models, including the Kling O1 on OpenVE-Bench. Furthermore, we introduce a joint few-step distillation and reinforcement learning framework that compresses the 30-step editing model into a 4-step model and greatly accelerates model inference. Compared to open-source baselines, Mamoda2.5 achieves up to $95.9\times$ faster video editing inference. In real-world applications, Mamoda2.5 has been successfully deployed for content moderation and creative restoration tasks in advertising scenarios, achieving a 98% success rate in internal advertising video editing scenario.

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

Abstract

We present Mamoda2.5, a unified AR-Diffusion framework that seamlessly integrates multimodal understanding and generation within a single architecture. To efficiently enhance the model's generation capability, we equip the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts (MoE) design (128 experts, Top-8 routing), yielding a 25B-parameter model that activates only 3B parameters, significantly reducing training costs while scaling up the model capacity. Mamoda2.5 achieves top-tier generation performance on VBench 2.0 and sets a new record in video editing quality, surpassing evaluated open-source models and matching the performance of current top-tier proprietary models, including the Kling O1 on OpenVE-Bench. Furthermore, we introduce a joint few-step distillation and reinforcement learning framework that compresses the 30-step editing model into a 4-step model and greatly accelerates model inference. Compared to open-source baselines, Mamoda2.5 achieves up to faster video editing inference. In real-world applications, Mamoda2.5 has been successfully deployed for content moderation and creative restoration tasks in advertising scenarios, achieving a 98% success rate in internal advertising video editing scenario.

Paper Structure

This paper contains 51 sections, 2 equations, 24 figures, 13 tables.

Figures (24)

  • Figure 1: Benchmark performance of Mamoda2.5 and its counterparts.
  • Figure 2: Mamoda2.5 showcase.
  • Figure 3: Overall architecture of Mamoda2.5. The unified AR--Diffusion pipeline organizes instruction understanding and visual generation/editing into a single end-to-end framework. The AR module produces conditional representations via a MetaQueries mechanism, which are then injected into the DiT-MoE backbone together with text/visual conditions for iterative denoising in latent space.
  • Figure 4: Illustration of the DiT-MoE block. Each block replaces the standard FFN sublayer with a Mixture-of-Experts layer comprising fine-grained routed experts. A sigmoid-based Top-K gating mechanism with loss-free Expert Bias controls expert selection and load balancing.
  • Figure 5: Overview of the proposed video editing data synthesis pipeline. Stage 1: LLM-based prompt pair generation. Stage 2: paired video synthesis with shared denoising steps for structural consistency. Stage 3: VLM-based recaptioning, quality filtering, and bidirectional inversion to double the training set.
  • ...and 19 more figures