Table of Contents
Fetching ...

StereoFoley: Object-Aware Stereo Audio Generation from Video

Tornike Karchkhadze, Kuan-Lin Chen, Mojtaba Heydari, Robert Henzel, Alessandro Toso, Mehrez Souden, Joshua Atkins

TL;DR

StereoFoley addresses the absence of object-aware stereo video-to-audio generation by combining a diffusion-based base model with a synthetic data pipeline that grounds sounds to visible objects in video. The approach yields semantically accurate, temporally aligned, and spatially consistent stereo audio at 48 kHz, with StereoFoley-obj achieving the strongest object–audio correspondence. The authors introduce a BAS metric and conduct a human MOS study, showing strong correlation between the objective measure and perceived stereo alignment. The work establishes the first end-to-end framework for stereo object-aware V2A and demonstrates competitive performance against state-of-the-art baselines, highlighting data and spatialization as key enablers.

Abstract

We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely remain limited to mono or fail to deliver object-aware stereo imaging, constrained by the lack of professionally mixed, spatially accurate video-to-audio datasets. First, we develop and train a base model that generates stereo audio from video, achieving state-of-the-art in both semantic accuracy and synchronization. Next, to overcome dataset limitations, we introduce a synthetic data generation pipeline that combines video analysis, object tracking, and audio synthesis with dynamic panning and distance-based loudness controls, enabling spatially accurate object-aware sound. Finally, we fine-tune the base model on this synthetic dataset, yielding clear object-audio correspondence. Since no established metrics exist, we introduce stereo object-awareness measures and validate it through a human listening study, showing strong correlation with perception. This work establishes the first end-to-end framework for stereo object-aware video-to-audio generation, addressing a critical gap and setting a new benchmark in the field.

StereoFoley: Object-Aware Stereo Audio Generation from Video

TL;DR

StereoFoley addresses the absence of object-aware stereo video-to-audio generation by combining a diffusion-based base model with a synthetic data pipeline that grounds sounds to visible objects in video. The approach yields semantically accurate, temporally aligned, and spatially consistent stereo audio at 48 kHz, with StereoFoley-obj achieving the strongest object–audio correspondence. The authors introduce a BAS metric and conduct a human MOS study, showing strong correlation between the objective measure and perceived stereo alignment. The work establishes the first end-to-end framework for stereo object-aware V2A and demonstrates competitive performance against state-of-the-art baselines, highlighting data and spatialization as key enablers.

Abstract

We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely remain limited to mono or fail to deliver object-aware stereo imaging, constrained by the lack of professionally mixed, spatially accurate video-to-audio datasets. First, we develop and train a base model that generates stereo audio from video, achieving state-of-the-art in both semantic accuracy and synchronization. Next, to overcome dataset limitations, we introduce a synthetic data generation pipeline that combines video analysis, object tracking, and audio synthesis with dynamic panning and distance-based loudness controls, enabling spatially accurate object-aware sound. Finally, we fine-tune the base model on this synthetic dataset, yielding clear object-audio correspondence. Since no established metrics exist, we introduce stereo object-awareness measures and validate it through a human listening study, showing strong correlation with perception. This work establishes the first end-to-end framework for stereo object-aware video-to-audio generation, addressing a critical gap and setting a new benchmark in the field.

Paper Structure

This paper contains 16 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: StereoFoley system overview: video, audio, and text encoders with a Diffusion-Transformer backbone.
  • Figure 2: Overview of the object-aware stereo data generation pipeline. (a) video scene analysis with LLM, (b) object detection and tracking with segmentation, (c) audio generation and synchronization with T2A and V2A models, and (d) stereo spatialization via dynamic panning and distance-based loudness, mixing with generated background sound, producing the final object-aware stereo mix for the video.