DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation

Haomin Zhang; Chang Liu; Junjie Zheng; Zihao Chen; Chaofan Ding; Xinhan Di

DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation

Haomin Zhang, Chang Liu, Junjie Zheng, Zihao Chen, Chaofan Ding, Xinhan Di

TL;DR

DeepAudio-V1 tackles the challenge of generating synchronized ambient audio and expressive speech from video and text by unifying video-to-audio, video-to-speech, and text-to-speech within an end-to-end framework. It presents a four-stage pipeline—V2A learning, TTS learning, dynamic Mixture of Modality Fusion, and V2S fine-tuning—along with energy-contour guidance to improve audiovisual synchronization. Empirical results demonstrate state-of-the-art or competitive performance across V2A, V2S, and TTS benchmarks, with notable gains in WER, speaker and emotion similarity, and spectral fidelity. The approach enables richer, context-aware audio-visual generation and sets the stage for robust multimodal dubbing and narration in real-world videos.

Abstract

Currently, high-quality, synchronized audio is synthesized using various multi-modal joint learning frameworks, leveraging video and optional text inputs. In the video-to-audio benchmarks, video-to-audio quality, semantic alignment, and audio-visual synchronization are effectively achieved. However, in real-world scenarios, speech and audio often coexist in videos simultaneously, and the end-to-end generation of synchronous speech and audio given video and text conditions are not well studied. Therefore, we propose an end-to-end multi-modal generation framework that simultaneously produces speech and audio based on video and text conditions. Furthermore, the advantages of video-to-audio (V2A) models for generating speech from videos remain unclear. The proposed framework, DeepAudio, consists of a video-to-audio (V2A) module, a text-to-speech (TTS) module, and a dynamic mixture of modality fusion (MoF) module. In the evaluation, the proposed end-to-end framework achieves state-of-the-art performance on the video-audio benchmark, video-speech benchmark, and text-speech benchmark. In detail, our framework achieves comparable results in the comparison with state-of-the-art models for the video-audio and text-speech benchmarks, and surpassing state-of-the-art models in the video-speech benchmark, with WER 16.57% to 3.15% (+80.99%), SPK-SIM 78.30% to 89.38% (+14.15%), EMO-SIM 66.24% to 75.56% (+14.07%), MCD 8.59 to 7.98 (+7.10%), MCD SL 11.05 to 9.40 (+14.93%) across a variety of dubbing settings.

DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation

TL;DR

Abstract

DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)