ALIVE: Animate Your World with Lifelike Audio-Video Generation
Ying Guo, Qijun Gan, Yifu Zhang, Jinlai Liu, Yifei Hu, Pan Xie, Dongjun Qian, Yu Zhang, Ruiqi Li, Yuqi Zhang, Ruibiao Lu, Xiaofeng Mei, Bo Han, Xiang Yin, Bingyue Peng, Zehuan Yuan
TL;DR
ALIVE advances unified audio-video generation by extending a pretrained Text-to-Video model with joint Audio-Video DiT and a cascaded Refiner, enabling T2VA, I2VA, and R2VA capabilities. It introduces UniTemp-RoPE for continuous temporal alignment, TA-CrossAttn for cross-modal fusion, and a comprehensive data pipeline with subject-ID correction, labeling, and hierarchical filtering, together with Alive-Bench 1.0 for multi-dimensional evaluation. The approach emphasizes multi-stage training (T2A, T2VA, I2VA, Refiner), asymmetric optimization strategies to balance audio and visual learning, and aesthetic finetuning to elevate visual quality while preserving audio fidelity. Role-playing animation is enabled via multi-reference conditioning and dual-criterion inference, achieving strong identity preservation and synchronized audio-visual output. Collectively, ALIVE demonstrates state-of-the-art or competitive performance across motion, aesthetics, and audio-visual synchronization, offering practical recipes and benchmarks for the community to develop robust audio-video generation systems.
Abstract
Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page: https://github.com/FoundationVision/Alive.
