Table of Contents
Fetching ...

Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, Youliang Zhang, Yuan Zhou, Qinglin Lu, Ran Yi

TL;DR

Harmony tackles the challenging problem of robust audio-visual synchronization in open-source joint diffusion models. It introduces Cross-Task Synergy to stabilize alignment learning, a Global-Local Decoupled Interaction Module for separating timing and style, and Synchronization-Enhanced CFG to explicitly amplify alignment during inference. The approach yields state-of-the-art synchronization metrics while maintaining high video quality and audio fidelity across diverse sound types and visual styles. This work provides a practical and generalizable framework to produce tightly synchronized, multimodal content in accessible, open-source settings.

Abstract

The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.

Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

TL;DR

Harmony tackles the challenging problem of robust audio-visual synchronization in open-source joint diffusion models. It introduces Cross-Task Synergy to stabilize alignment learning, a Global-Local Decoupled Interaction Module for separating timing and style, and Synchronization-Enhanced CFG to explicitly amplify alignment during inference. The approach yields state-of-the-art synchronization metrics while maintaining high video quality and audio fidelity across diverse sound types and visual styles. This work provides a practical and generalizable framework to produce tightly synchronized, multimodal content in accessible, open-source settings.

Abstract

The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.

Paper Structure

This paper contains 32 sections, 10 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Harmony employs a cross-task synergy training strategy to achieve robust audio-visual synchronization. This versatile framework supports multiple generation paradigms, including joint audio-video synthesis as well as audio-driven and video-driven generation, while also demonstrating strong generalizability to diverse audio types (e.g., music) and visual styles.
  • Figure 2: (a) Mitigating Correspondence Drift with Cross-Task Synergy. Our training paradigm leverages a supervised audio- and video-driven task to provide a strong alignment signal. This instills robust synchronization features in the model, stabilizing the otherwise stochastic joint generation process. (b) Overview of the Harmony Model. The architecture features parallel branches for multimodal inputs. The video stream is conditioned on a reference image and a descriptive prompt. The audio stream is conditioned on a reference audio, an ambient sound description, and a speech transcript. The model then generates a single, synchronized audio-visual result.
  • Figure 3: Comparison of the audio-video alignment score among different training strategies.
  • Figure 4: SyncCFG employs the mute audio and static video as the negative anchors to capture the synchronization feature, which can effectively enhance the audio-video alignment.
  • Figure 5: Qualitative Comparison between Harmony and the state-of-the-art methods, including Universe-1 wang2025universe and Ovi low2025ovi.
  • ...and 8 more figures