Table of Contents
Fetching ...

StereoSync: Spatially-Aware Stereo Audio Generation from Video

Christian Marinoni, Riccardo Fosco Gramaccioni, Kazuki Shimada, Takashi Shibuya, Yuki Mitsufuji, Danilo Comminiello

TL;DR

StereoSync addresses the challenge of generating spatially-aware stereo audio from video by leveraging pretrained foundation models to extract depth and bounding-box cues, which condition a diffusion-based audio generator via cross-attention. The method combines global scene geometry, object motion, semantic embeddings, and temporal envelopes within a Stable Audio latent diffusion framework, training only a lightweight ControlNet and projection layers for efficiency. On Walking The Maps, StereoSync demonstrates robust temporal, semantic, and spatial alignment, achieving improved spatial coherence (Spatial AV-Align) while maintaining high audio quality (FAD, FAVD) and temporal accuracy (E-L1). This approach enables more immersive audiovisual experiences and reduces training overhead by reusing foundation models, with future work targeting binaural and multi-channel extensions to further enhance spatial realism.

Abstract

Although audio generation has been widely studied over recent years, video-aligned audio generation still remains a relatively unexplored frontier. To address this gap, we introduce StereoSync, a novel and efficient model designed to generate audio that is both temporally synchronized with a reference video and spatially aligned with its visual context. Moreover, StereoSync also achieves efficiency by leveraging pretrained foundation models, reducing the need for extensive training while maintaining high-quality synthesis. Unlike existing methods that primarily focus on temporal synchronization, StereoSync introduces a significant advancement by incorporating spatial awareness into video-aligned audio generation. Indeed, given an input video, our approach extracts spatial cues from depth maps and bounding boxes, using them as cross-attention conditioning in a diffusion-based audio generation model. Such an approach allows StereoSync to go beyond simple synchronization, producing stereo audio that dynamically adapts to the spatial structure and movement of a video scene. We evaluate StereoSync on Walking The Maps, a curated dataset comprising videos from video games that feature animated characters walking through diverse environments. Experimental results demonstrate the ability of StereoSync to achieve both temporal and spatial alignment, advancing the state of the art in video-to-audio generation and resulting in a significantly more immersive and realistic audio experience.

StereoSync: Spatially-Aware Stereo Audio Generation from Video

TL;DR

StereoSync addresses the challenge of generating spatially-aware stereo audio from video by leveraging pretrained foundation models to extract depth and bounding-box cues, which condition a diffusion-based audio generator via cross-attention. The method combines global scene geometry, object motion, semantic embeddings, and temporal envelopes within a Stable Audio latent diffusion framework, training only a lightweight ControlNet and projection layers for efficiency. On Walking The Maps, StereoSync demonstrates robust temporal, semantic, and spatial alignment, achieving improved spatial coherence (Spatial AV-Align) while maintaining high audio quality (FAD, FAVD) and temporal accuracy (E-L1). This approach enables more immersive audiovisual experiences and reduces training overhead by reusing foundation models, with future work targeting binaural and multi-channel extensions to further enhance spatial realism.

Abstract

Although audio generation has been widely studied over recent years, video-aligned audio generation still remains a relatively unexplored frontier. To address this gap, we introduce StereoSync, a novel and efficient model designed to generate audio that is both temporally synchronized with a reference video and spatially aligned with its visual context. Moreover, StereoSync also achieves efficiency by leveraging pretrained foundation models, reducing the need for extensive training while maintaining high-quality synthesis. Unlike existing methods that primarily focus on temporal synchronization, StereoSync introduces a significant advancement by incorporating spatial awareness into video-aligned audio generation. Indeed, given an input video, our approach extracts spatial cues from depth maps and bounding boxes, using them as cross-attention conditioning in a diffusion-based audio generation model. Such an approach allows StereoSync to go beyond simple synchronization, producing stereo audio that dynamically adapts to the spatial structure and movement of a video scene. We evaluate StereoSync on Walking The Maps, a curated dataset comprising videos from video games that feature animated characters walking through diverse environments. Experimental results demonstrate the ability of StereoSync to achieve both temporal and spatial alignment, advancing the state of the art in video-to-audio generation and resulting in a significantly more immersive and realistic audio experience.

Paper Structure

This paper contains 16 sections, 6 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: StereoSync generates a stereo audio that resembles the spatial context of an input video.
  • Figure 2: StereoSync architecture: the proposed method extracts depth maps and bounding boxes from input videos through video foundation models RollingDepth and MASA. Relevant features are then extracted to represent the spatiality of the scene. These features, along with a CLAP embedding of the audio sample that is expected to characterize the semantics of the final audio, are used to condition the audio synthesis model. Temporal control is provided through an envelope signal which is used as ControlNet input. The only weights that are trained in the architecture are the ControlNet weights and projection layers weights used to map the conditioning embeddings to the correct shape required by Stable Audio, making our model lightweight and efficient.
  • Figure 3: Examples from the Walking The Maps dataset showing the bounding box of the subjects and their corresponding movement represented as trajectories of the center of the bounding box.