Table of Contents
Fetching ...

TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis

Tri Ton, Ji Woo Hong, Chang D. Yoo

TL;DR

TARO tackles video-to-audio synthesis by integrating a flow-based transformer with two innovations: Timestep-Adaptive Representation Alignment (TRA), which modulates latent-audio alignment strength along the diffusion noise schedule, and Onset-Aware Conditioning (OAC), which anchors audio generation to event-driven visual cues. By injecting pretrained audio priors through TRA and using onset cues to guide timing, TARO achieves superior fidelity and synchronization on VGGSound and Landscape, while maintaining efficient inference. The method leverages a convolution-based projection to address sequence-length mismatches and employs AdaLN for robust multimodal fusion, resulting in state-of-the-art FD, FAD, and alignment metrics and strong perceptual quality (MOS-Q and MOS-A). Overall, TARO offers a practical and effective approach for high-quality, temporally aligned video-to-audio synthesis with real-time potential in Foley and multimedia workflows.

Abstract

This paper introduces Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning (TARO), a novel framework for high-fidelity and temporally coherent video-to-audio synthesis. Built upon flow-based transformers, which offer stable training and continuous transformations for enhanced synchronization and audio quality, TARO introduces two key innovations: (1) Timestep-Adaptive Representation Alignment (TRA), which dynamically aligns latent representations by adjusting alignment strength based on the noise schedule, ensuring smooth evolution and improved fidelity, and (2) Onset-Aware Conditioning (OAC), which integrates onset cues that serve as sharp event-driven markers of audio-relevant visual moments to enhance synchronization with dynamic visual events. Extensive experiments on the VGGSound and Landscape datasets demonstrate that TARO outperforms prior methods, achieving relatively 53% lower Frechet Distance (FD), 29% lower Frechet Audio Distance (FAD), and a 97.19% Alignment Accuracy, highlighting its superior audio quality and synchronization precision.

TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis

TL;DR

TARO tackles video-to-audio synthesis by integrating a flow-based transformer with two innovations: Timestep-Adaptive Representation Alignment (TRA), which modulates latent-audio alignment strength along the diffusion noise schedule, and Onset-Aware Conditioning (OAC), which anchors audio generation to event-driven visual cues. By injecting pretrained audio priors through TRA and using onset cues to guide timing, TARO achieves superior fidelity and synchronization on VGGSound and Landscape, while maintaining efficient inference. The method leverages a convolution-based projection to address sequence-length mismatches and employs AdaLN for robust multimodal fusion, resulting in state-of-the-art FD, FAD, and alignment metrics and strong perceptual quality (MOS-Q and MOS-A). Overall, TARO offers a practical and effective approach for high-quality, temporally aligned video-to-audio synthesis with real-time potential in Foley and multimedia workflows.

Abstract

This paper introduces Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning (TARO), a novel framework for high-fidelity and temporally coherent video-to-audio synthesis. Built upon flow-based transformers, which offer stable training and continuous transformations for enhanced synchronization and audio quality, TARO introduces two key innovations: (1) Timestep-Adaptive Representation Alignment (TRA), which dynamically aligns latent representations by adjusting alignment strength based on the noise schedule, ensuring smooth evolution and improved fidelity, and (2) Onset-Aware Conditioning (OAC), which integrates onset cues that serve as sharp event-driven markers of audio-relevant visual moments to enhance synchronization with dynamic visual events. Extensive experiments on the VGGSound and Landscape datasets demonstrate that TARO outperforms prior methods, achieving relatively 53% lower Frechet Distance (FD), 29% lower Frechet Audio Distance (FAD), and a 97.19% Alignment Accuracy, highlighting its superior audio quality and synchronization precision.

Paper Structure

This paper contains 19 sections, 8 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Our TARO leverages onset-aware conditioning to improve synchronization, aligning generated audio with event-driven cues in video.
  • Figure 2: Comparison of FD (y-axis) and inference speed (x-axis) across different models, with marker size representing parameter size.
  • Figure 3: Overview of TARO. Our TARO is a flow-based multimodal transformer for video-to-audio generation, integrating Timestep-Adaptive Representation Alignment (TRA) and Onset-Aware Conditioning (OAC) to enhance synchronization and fidelity. Black arrows $\rightarrow$ denote branches used only in training, blue arrows $\color{blue}\rightarrow$ for inference only, and green arrows $\color{green}\rightarrow$ for both training and inference.
  • Figure 4: One Multimodal Transformer Blocks. The block integrates onset cues, visual features, and latent audio representations through adaptive modulation and joint attention.$+$ denotes summation, and $\text{C}$ represents concatenation.
  • Figure 5: Quantitative comparisons of video-to-audio models. Our TARO achieves superior synchronization and fidelity, closely aligning with the ground truth, while other methods often produce misaligned or extraneous audio, underscoring its effectiveness in capturing event-driven acoustic details.
  • ...and 5 more figures