Table of Contents
Fetching ...

ALIVE: Animate Your World with Lifelike Audio-Video Generation

Ying Guo, Qijun Gan, Yifu Zhang, Jinlai Liu, Yifei Hu, Pan Xie, Dongjun Qian, Yu Zhang, Ruiqi Li, Yuqi Zhang, Ruibiao Lu, Xiaofeng Mei, Bo Han, Xiang Yin, Bingyue Peng, Zehuan Yuan

TL;DR

ALIVE advances unified audio-video generation by extending a pretrained Text-to-Video model with joint Audio-Video DiT and a cascaded Refiner, enabling T2VA, I2VA, and R2VA capabilities. It introduces UniTemp-RoPE for continuous temporal alignment, TA-CrossAttn for cross-modal fusion, and a comprehensive data pipeline with subject-ID correction, labeling, and hierarchical filtering, together with Alive-Bench 1.0 for multi-dimensional evaluation. The approach emphasizes multi-stage training (T2A, T2VA, I2VA, Refiner), asymmetric optimization strategies to balance audio and visual learning, and aesthetic finetuning to elevate visual quality while preserving audio fidelity. Role-playing animation is enabled via multi-reference conditioning and dual-criterion inference, achieving strong identity preservation and synchronized audio-visual output. Collectively, ALIVE demonstrates state-of-the-art or competitive performance across motion, aesthetics, and audio-visual synchronization, offering practical recipes and benchmarks for the community to develop robust audio-video generation systems.

Abstract

Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page: https://github.com/FoundationVision/Alive.

ALIVE: Animate Your World with Lifelike Audio-Video Generation

TL;DR

ALIVE advances unified audio-video generation by extending a pretrained Text-to-Video model with joint Audio-Video DiT and a cascaded Refiner, enabling T2VA, I2VA, and R2VA capabilities. It introduces UniTemp-RoPE for continuous temporal alignment, TA-CrossAttn for cross-modal fusion, and a comprehensive data pipeline with subject-ID correction, labeling, and hierarchical filtering, together with Alive-Bench 1.0 for multi-dimensional evaluation. The approach emphasizes multi-stage training (T2A, T2VA, I2VA, Refiner), asymmetric optimization strategies to balance audio and visual learning, and aesthetic finetuning to elevate visual quality while preserving audio fidelity. Role-playing animation is enabled via multi-reference conditioning and dual-criterion inference, achieving strong identity preservation and synchronized audio-visual output. Collectively, ALIVE demonstrates state-of-the-art or competitive performance across motion, aesthetics, and audio-visual synchronization, offering practical recipes and benchmarks for the community to develop robust audio-video generation systems.

Abstract

Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page: https://github.com/FoundationVision/Alive.
Paper Structure (42 sections, 7 equations, 17 figures, 2 tables)

This paper contains 42 sections, 7 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Left: Human evaluation win rates (GSB) of ALIVE compared to Veo 3.1, Kling 2.6, Wan 2.6, Sora 2 and LTX-2 on Alive-Bench 1.0 across six dimensions: Motion Quality, Visual Aesthetic, Visual Prompt Following, Audio Quuality, Audio Prompt Following and Audio Video Synchronization. Alive-Bench 1.0 covers a wide range of scenarios, including single-person speech, multi-people conversations, sports, daily activities, animals, means of transportation, surreal scenes, etc.
  • Figure 2: T2V samples generated by ALIVE. ALIVE is capable of generating 1080p videos at arbitrary aspect ratios, delivering high levels of aesthetic quality, realism, and motion fidelity, while simultaneously supporting both T2VA, I2VA and R2VA tasks.
  • Figure 3: Architecture of ALIVE.
  • Figure 4: Architecture of Audio DiT.
  • Figure 5: An overview of our proposed data processing pipeline. The process consists of six main stages: video quality pre-processing, captioning, audio quality filtering, SubjectID correction, clarity filtering, and data balancing.
  • ...and 12 more figures