Table of Contents
Fetching ...

DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation

Haomin Zhang, Chang Liu, Junjie Zheng, Zihao Chen, Chaofan Ding, Xinhan Di

TL;DR

DeepAudio-V1 tackles the challenge of generating synchronized ambient audio and expressive speech from video and text by unifying video-to-audio, video-to-speech, and text-to-speech within an end-to-end framework. It presents a four-stage pipeline—V2A learning, TTS learning, dynamic Mixture of Modality Fusion, and V2S fine-tuning—along with energy-contour guidance to improve audiovisual synchronization. Empirical results demonstrate state-of-the-art or competitive performance across V2A, V2S, and TTS benchmarks, with notable gains in WER, speaker and emotion similarity, and spectral fidelity. The approach enables richer, context-aware audio-visual generation and sets the stage for robust multimodal dubbing and narration in real-world videos.

Abstract

Currently, high-quality, synchronized audio is synthesized using various multi-modal joint learning frameworks, leveraging video and optional text inputs. In the video-to-audio benchmarks, video-to-audio quality, semantic alignment, and audio-visual synchronization are effectively achieved. However, in real-world scenarios, speech and audio often coexist in videos simultaneously, and the end-to-end generation of synchronous speech and audio given video and text conditions are not well studied. Therefore, we propose an end-to-end multi-modal generation framework that simultaneously produces speech and audio based on video and text conditions. Furthermore, the advantages of video-to-audio (V2A) models for generating speech from videos remain unclear. The proposed framework, DeepAudio, consists of a video-to-audio (V2A) module, a text-to-speech (TTS) module, and a dynamic mixture of modality fusion (MoF) module. In the evaluation, the proposed end-to-end framework achieves state-of-the-art performance on the video-audio benchmark, video-speech benchmark, and text-speech benchmark. In detail, our framework achieves comparable results in the comparison with state-of-the-art models for the video-audio and text-speech benchmarks, and surpassing state-of-the-art models in the video-speech benchmark, with WER 16.57% to 3.15% (+80.99%), SPK-SIM 78.30% to 89.38% (+14.15%), EMO-SIM 66.24% to 75.56% (+14.07%), MCD 8.59 to 7.98 (+7.10%), MCD SL 11.05 to 9.40 (+14.93%) across a variety of dubbing settings.

DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation

TL;DR

DeepAudio-V1 tackles the challenge of generating synchronized ambient audio and expressive speech from video and text by unifying video-to-audio, video-to-speech, and text-to-speech within an end-to-end framework. It presents a four-stage pipeline—V2A learning, TTS learning, dynamic Mixture of Modality Fusion, and V2S fine-tuning—along with energy-contour guidance to improve audiovisual synchronization. Empirical results demonstrate state-of-the-art or competitive performance across V2A, V2S, and TTS benchmarks, with notable gains in WER, speaker and emotion similarity, and spectral fidelity. The approach enables richer, context-aware audio-visual generation and sets the stage for robust multimodal dubbing and narration in real-world videos.

Abstract

Currently, high-quality, synchronized audio is synthesized using various multi-modal joint learning frameworks, leveraging video and optional text inputs. In the video-to-audio benchmarks, video-to-audio quality, semantic alignment, and audio-visual synchronization are effectively achieved. However, in real-world scenarios, speech and audio often coexist in videos simultaneously, and the end-to-end generation of synchronous speech and audio given video and text conditions are not well studied. Therefore, we propose an end-to-end multi-modal generation framework that simultaneously produces speech and audio based on video and text conditions. Furthermore, the advantages of video-to-audio (V2A) models for generating speech from videos remain unclear. The proposed framework, DeepAudio, consists of a video-to-audio (V2A) module, a text-to-speech (TTS) module, and a dynamic mixture of modality fusion (MoF) module. In the evaluation, the proposed end-to-end framework achieves state-of-the-art performance on the video-audio benchmark, video-speech benchmark, and text-speech benchmark. In detail, our framework achieves comparable results in the comparison with state-of-the-art models for the video-audio and text-speech benchmarks, and surpassing state-of-the-art models in the video-speech benchmark, with WER 16.57% to 3.15% (+80.99%), SPK-SIM 78.30% to 89.38% (+14.15%), EMO-SIM 66.24% to 75.56% (+14.07%), MCD 8.59 to 7.98 (+7.10%), MCD SL 11.05 to 9.40 (+14.93%) across a variety of dubbing settings.

Paper Structure

This paper contains 19 sections, 12 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of our DeepAudio framework. Traditional methods separately address V2A and V2S tasks. Our proposed DeepAudio framework unifies these tasks by supporting both V2A and V2S generation through one unified model, allowing users to generate either synchronized ambient sounds or expressive speech based on different textual inputs.
  • Figure 2: The overview of DeepAudio framework. The proposed DeepAudio framework unifies video-to-audio (V2A) and video-to-speech (V2S) generation in a multi-stage, end-to-end paradigm. The top section illustrates two independent generation paths: (1) The V2A module, which synthesizes ambient audio from video input using a CLIP-based multi-modal feature encoding and a noised latent representation, and (2) The TTS module, which generates speech conditioned on text and noised latent features. Both modules rely on codec decoders to reconstruct high-fidelity outputs. The bottom section presents the MoF module, an integrated multi-modal system that takes text, video, and instructions as inputs. A Gating Network adaptively fuses outputs from the V2A module and the TTS module, ensuring synchronized and context-aware audio-visual generation.
  • Figure 3: Mixture of modality Fusion (MoF) framework. The MoF framework dynamically fuses multi-modal inputs (text, video, instructions, and transcripts) through a Gating Network, which adaptively routes features to the V2A and TTS modules. This enables flexible and context-aware audio-visual generation.
  • Figure 4: Video-to-Speech Tuning. The V2A-predicted energy contours and transcripts guide the TTS module to generate synchronized speech, ensuring improved alignment with video input through cross-modal conditioning.
  • Figure 5: Mel-spectrograms of ground truth and synthesized audio samples from different methods under V2C-Animation Dub 2.0 setting.