Table of Contents
Fetching ...

SoundReactor: Frame-level Online Video-to-Audio Generation

Koichi Saito, Julian Tanke, Christian Simon, Masato Ishii, Kazuki Shimada, Zachary Novack, Zhi Zhong, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji

TL;DR

This work introduces frame-level online video-to-audio generation, a setting where audio is generated autoregressively from video without access to future frames. It proposes SoundReactor, a simple yet effective framework with three components: video token modeling via DINOv2 grid features augmented by temporal differences, audio token modeling via a continuous stereo VAE, and a decoder-only multimodal transformer with a diffusion head guided by classifier-free conditioning. Training combines diffusion pretraining and Easy Consistency Tuning (ECT) to accelerate diffusion decoding, achieving low per-frame latency (about 26.3–31.5 ms on 30 FPS 480p video) while maintaining semantic alignment and high audio quality on AAA-game data. The results, including objective and human evaluations, demonstrate strong frame-level online V2A performance and highlight the potential for interactive multimodal world models and live content creation, with ablations offering practical design insights.

Abstract

Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end-to-end causality and targets low per-frame latency with audio-visual synchronization. Our model's backbone is a decoder-only causal transformer over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end-to-end causality and efficiency. The model is trained through a diffusion pre-training followed by consistency fine-tuning to accelerate the diffusion head decoding. On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high-quality full-band stereo audio, validated by both objective and human evaluations. Furthermore, our model achieves low per-frame waveform-level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos using a single H100. Demo samples are available at https://koichi-saito-sony.github.io/soundreactor/.

SoundReactor: Frame-level Online Video-to-Audio Generation

TL;DR

This work introduces frame-level online video-to-audio generation, a setting where audio is generated autoregressively from video without access to future frames. It proposes SoundReactor, a simple yet effective framework with three components: video token modeling via DINOv2 grid features augmented by temporal differences, audio token modeling via a continuous stereo VAE, and a decoder-only multimodal transformer with a diffusion head guided by classifier-free conditioning. Training combines diffusion pretraining and Easy Consistency Tuning (ECT) to accelerate diffusion decoding, achieving low per-frame latency (about 26.3–31.5 ms on 30 FPS 480p video) while maintaining semantic alignment and high audio quality on AAA-game data. The results, including objective and human evaluations, demonstrate strong frame-level online V2A performance and highlight the potential for interactive multimodal world models and live content creation, with ablations offering practical design insights.

Abstract

Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end-to-end causality and targets low per-frame latency with audio-visual synchronization. Our model's backbone is a decoder-only causal transformer over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end-to-end causality and efficiency. The model is trained through a diffusion pre-training followed by consistency fine-tuning to accelerate the diffusion head decoding. On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high-quality full-band stereo audio, validated by both objective and human evaluations. Furthermore, our model achieves low per-frame waveform-level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos using a single H100. Demo samples are available at https://koichi-saito-sony.github.io/soundreactor/.

Paper Structure

This paper contains 67 sections, 10 equations, 9 figures, 10 tables, 3 algorithms.

Figures (9)

  • Figure 1: Our scope is frame-level online video-to-audio (V2A) generation task, where future video frames are not available in advance. This contrasts with conventional offline V2A task, where an entire video sequence or chunk of frames is available in advance.
  • Figure 2: Overview of SoundReactor. Our framework has three components: (a) Video token modeling, (b) Audio token modeling, and (c) Multimodal AR transformer with diffusion head.
  • Figure 3: Spectrograms of long-seq. generation (twice the training window) with SoundReactor-ECT (NFE=4) comparing using Sliding Window Attention (SWA), Position Interpolation (PI), NTK-aware Interpolation (NTK) and the ground-truth (GT). PI results in a slower cadence for periodic sound.
  • Figure 4: Ablation study on ECT. $\texttt{CF}$ and $\texttt{IN}$ indicate different mapping functions (see Appendix \ref{['ssec:appendix_training_detail_ogamedata']}). $0.1$ and $0.2$ denote dropout rates srivastava2014dropout during finetuning. $\texttt{CF}$--$0.2$ is our default.
  • Figure 5: Visualization of DINOv2 grid features and cosine similarity of their [CLS]-tokens between (A) adjacent frames and (B) every other frame. Visualizations are done by applying PCA to grid features and mapping them to RGB.
  • ...and 4 more figures