Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

Teng Hu; Zhentao Yu; Guozhen Zhang; Zihan Su; Zhengguang Zhou; Youliang Zhang; Yuan Zhou; Qinglin Lu; Ran Yi

Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, Youliang Zhang, Yuan Zhou, Qinglin Lu, Ran Yi

TL;DR

Harmony tackles the challenging problem of robust audio-visual synchronization in open-source joint diffusion models. It introduces Cross-Task Synergy to stabilize alignment learning, a Global-Local Decoupled Interaction Module for separating timing and style, and Synchronization-Enhanced CFG to explicitly amplify alignment during inference. The approach yields state-of-the-art synchronization metrics while maintaining high video quality and audio fidelity across diverse sound types and visual styles. This work provides a practical and generalizable framework to produce tightly synchronized, multimodal content in accessible, open-source settings.

Abstract

The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.

Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

TL;DR

Abstract

Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)