TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis
Tri Ton, Ji Woo Hong, Chang D. Yoo
TL;DR
TARO tackles video-to-audio synthesis by integrating a flow-based transformer with two innovations: Timestep-Adaptive Representation Alignment (TRA), which modulates latent-audio alignment strength along the diffusion noise schedule, and Onset-Aware Conditioning (OAC), which anchors audio generation to event-driven visual cues. By injecting pretrained audio priors through TRA and using onset cues to guide timing, TARO achieves superior fidelity and synchronization on VGGSound and Landscape, while maintaining efficient inference. The method leverages a convolution-based projection to address sequence-length mismatches and employs AdaLN for robust multimodal fusion, resulting in state-of-the-art FD, FAD, and alignment metrics and strong perceptual quality (MOS-Q and MOS-A). Overall, TARO offers a practical and effective approach for high-quality, temporally aligned video-to-audio synthesis with real-time potential in Foley and multimedia workflows.
Abstract
This paper introduces Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning (TARO), a novel framework for high-fidelity and temporally coherent video-to-audio synthesis. Built upon flow-based transformers, which offer stable training and continuous transformations for enhanced synchronization and audio quality, TARO introduces two key innovations: (1) Timestep-Adaptive Representation Alignment (TRA), which dynamically aligns latent representations by adjusting alignment strength based on the noise schedule, ensuring smooth evolution and improved fidelity, and (2) Onset-Aware Conditioning (OAC), which integrates onset cues that serve as sharp event-driven markers of audio-relevant visual moments to enhance synchronization with dynamic visual events. Extensive experiments on the VGGSound and Landscape datasets demonstrate that TARO outperforms prior methods, achieving relatively 53% lower Frechet Distance (FD), 29% lower Frechet Audio Distance (FAD), and a 97.19% Alignment Accuracy, highlighting its superior audio quality and synchronization precision.
