Table of Contents
Fetching ...

MOVA: Towards Scalable and Synchronized Video-Audio Generation

SII-OpenMOSS Team, :, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, Wenming Tu, Xiangyu Peng, Yang Gao, Yanru Huo, Ying Zhu, Yinze Luo, Yiyang Zhang, Yuerong Song, Zhe Xu, Zhiyu Zhang, Chenchen Yang, Cheng Chang, Chushu Zhou, Hanfu Chen, Hongnan Ma, Jiaxi Li, Jingqi Tong, Junxi Liu, Ke Chen, Shimin Li, Shiqi Jiang, Songlin Wang, Wei Jiang, Zhaoye Fei, Zhiyuan Ning, Chunguo Li, Chenhui Li, Ziwei He, Zengfeng Huang, Xie Chen, Xipeng Qiu

TL;DR

MOVA presents an open baseline for scalable, synchronized video–audio generation through an asymmetric dual-tower diffusion architecture connected by a bidirectional Bridge and aligned RoPE. The model jointly trains a video backbone and a 1.3B audio backbone with a three-phase data-centric training pipeline, enabling high fidelity lip synchronization and environment aware sounds across 360p and 720p outputs. Key innovations include decoupled modality noise schedules via dual sigma shift, heterogeneous learning rates, dual classifier-free guidance, and a comprehensive data curation and captioning pipeline that merges visual and auditory annotations. Empirical results on objective metrics and arena-based human evaluation demonstrate competitive audiovisual fidelity, synchronization, and speaker attribution relative to strong baselines, with MOVA showing emergent T2VA capabilities and robust performance under prompt-enhanced inference. The work emphasizes open science by releasing weights and code, and highlights future directions in reducing training cost and improving multi-speaker annotation reliability while maintaining audiovisual coherence.

Abstract

Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.

MOVA: Towards Scalable and Synchronized Video-Audio Generation

TL;DR

MOVA presents an open baseline for scalable, synchronized video–audio generation through an asymmetric dual-tower diffusion architecture connected by a bidirectional Bridge and aligned RoPE. The model jointly trains a video backbone and a 1.3B audio backbone with a three-phase data-centric training pipeline, enabling high fidelity lip synchronization and environment aware sounds across 360p and 720p outputs. Key innovations include decoupled modality noise schedules via dual sigma shift, heterogeneous learning rates, dual classifier-free guidance, and a comprehensive data curation and captioning pipeline that merges visual and auditory annotations. Empirical results on objective metrics and arena-based human evaluation demonstrate competitive audiovisual fidelity, synchronization, and speaker attribution relative to strong baselines, with MOVA showing emergent T2VA capabilities and robust performance under prompt-enhanced inference. The work emphasizes open science by releasing weights and code, and highlights future directions in reducing training cost and improving multi-speaker annotation reliability while maintaining audiovisual coherence.

Abstract

Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.
Paper Structure (65 sections, 9 equations, 10 figures, 9 tables, 2 algorithms)

This paper contains 65 sections, 9 equations, 10 figures, 9 tables, 2 algorithms.

Figures (10)

  • Figure 1: Overview of MOVA capabilities. MOVA generates synchronized video and audio across diverse scenarios: multi-speaker speech with precise lip synchronization in both English and Chinese, physical sound effects aligned with visual events, and scene text generation. The model supports both 16:9 and 9:16 aspect ratios.
  • Figure 2: Model Structure Overview. MOVA couples an A14B video DiT backbone and a 1.3B audio DiT backbone via a 2.6B bidirectional Bridge module.
  • Figure 3: Data curation overview. Our data pipeline consists of three stages. In the first stage, we preprocess the raw data into fixed-length clips with a resolution of 720p, a frame rate of 24fps, and a duration of 8.05s. In the second stage, we filter these clips based on audio quality, video quality, and audio-visual alignment to obtain high-quality, synchronized clips. In the third stage, we utilize Qwen3-Omni and MiMo-VL to label the audio and visual information within the videos, respectively, and finally use GPT-OSS to merge these single-modality captions. Through our data pipeline, we have successfully curated high-quality audio-visual content along with corresponding, semantically rich captions.
  • Figure 4: Training pipeline overview. (a) Audio tower pretraining: We train a 1.3B text-to-audio model with Wan2.1-style architecture on music, general sounds, and TTS data. The audio VAE remains frozen during this stage. (b) Synchronous joint training: The video tower (A14B, blue) and audio tower (1.3B, orange) are connected via bidirectional Bridge cross-attention modules. (c) Video and audio timesteps are sampled independently, allowing each modality to follow its own noise schedule. (d) Bridge modules use a higher learning rate ($\eta_{\text{br}}=2\times10^{-5}$) than backbone DiT blocks ($\eta_{\text{b}}=1\times10^{-5}$) to accelerate cross-modal alignment while preserving pretrained priors. Both VAEs remain frozen throughout training.
  • Figure 5: The overall workflow of MOVA for text-image and text-only to video-audio generation.
  • ...and 5 more figures