Table of Contents
Fetching ...

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu, Xuanyu Zhang, Nanhao Deng, Zhenran Xu, Yicheng Ma, Meishan Zhang, Baotian Hu, Min Zhang

TL;DR

Uni-MoE-2.0-Omni presents an open-source omnimodal large model built from a dense LLM backbone to achieve unified understanding and generation across text, image, audio, and video. The method centers on a dynamic-capacity MoE with three expert roles (Routed, Shared, Null), differentiable routing gradient estimation, and Omni-Modality 3D RoPE for cross-modal alignment, complemented by a Task-Aware Diffusion Transformer for image generation and a context-aware MoE-TTS for speech. A progressive training pipeline—including cross-modal pretraining, modality-specific warmups, mixed-data fine-tuning, annealing, and GSPO-DPO reinforcement learning—stabilizes optimization and enhances reasoning. Extensive evaluation on 85 benchmarks shows SOTA or competitive performance in video understanding, omnimodal comprehension, long-form speech, and controllable image generation, with the model open-sourced to accelerate reproducibility and further research.

Abstract

We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the dense LLM, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

TL;DR

Uni-MoE-2.0-Omni presents an open-source omnimodal large model built from a dense LLM backbone to achieve unified understanding and generation across text, image, audio, and video. The method centers on a dynamic-capacity MoE with three expert roles (Routed, Shared, Null), differentiable routing gradient estimation, and Omni-Modality 3D RoPE for cross-modal alignment, complemented by a Task-Aware Diffusion Transformer for image generation and a context-aware MoE-TTS for speech. A progressive training pipeline—including cross-modal pretraining, modality-specific warmups, mixed-data fine-tuning, annealing, and GSPO-DPO reinforcement learning—stabilizes optimization and enhances reasoning. Extensive evaluation on 85 benchmarks shows SOTA or competitive performance in video understanding, omnimodal comprehension, long-form speech, and controllable image generation, with the model open-sourced to accelerate reproducibility and further research.

Abstract

We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the dense LLM, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.

Paper Structure

This paper contains 70 sections, 17 equations, 10 figures, 21 tables, 1 algorithm.

Figures (10)

  • Figure 1: The performance of Uni-MoE-2.0-Omni and previous SOTA omnimodal large models.
  • Figure 2: The Uni-MoE-2.0-Omni architecture processes multimodal data through a unified tokenization strategy. Audio is tokenized in 30-second clips, augmented with generation tokens for voice control in the Context-Aware MoE-TTS module, while images are encoded using a sliding window technique. Image Generation Tokens bridge the model to a Task-Aware Diffusion Transformer for end-to-end generation tasks. The model's comprehension is powered by Omni-Modality 3D RoPE, which aligns inputs across time, and a dynamic-capacity MoE layer. This MoE layer dynamically routes information using diverse experts, with stability ensured by null experts (for token skipping) and modality-specific routed experts (A, V, T indicate audio, visual, and textual expert pretrained on corresponding data). In contrast, compact shared experts (only 1/8 size of routed experts) enable efficient cross-modal knowledge transfer.
  • Figure 3: The illustration of Context-Aware MoE-TTS. This figure uses different colored blocks to represent distinct token types, illustrating our long-context streaming decoding method. Furthermore, the Uni-MoE-TTS module will be released separately, featuring three unique and controllable voice styles.
  • Figure 4: The overview of the Task-aware Diffusion Transformer (Task-DiT). The role of the projection modules is to map external, task-conditioning features into the latent space of the Diffusion Transformer, where they are utilized as context in cross-attention blocks to guide the image generation.
  • Figure 5: The training recipe for adapting an LLM into an omnimodal large model.
  • ...and 5 more figures