Table of Contents
Fetching ...

AudioScenic: Audio-Driven Video Scene Editing

Kaixin Shen, Ruijie Quan, Linchao Zhu, Jun Xiao, Yi Yang

TL;DR

AudioScenic introduces audio-driven video scene editing to dynamically modulate background visuals while preserving foreground content. It integrates temporal-aware audio semantics via a novel injection method (TASI) and adds SceneMasker, Magnitude Modulator, and Frequency Fuser to control temporal dynamics and ensure coherence. A new temporal score metric assesses temporal consistency, and experiments on DAVIS and Audioset show substantial improvements over text-driven and audio-driven baselines. The approach enables diverse, audio-synchronized video edits with robust foreground preservation, scalable through a Stable Diffusion-based latent diffusion framework.

Abstract

Audio-driven visual scene editing endeavors to manipulate the visual background while leaving the foreground content unchanged, according to the given audio signals. Unlike current efforts focusing primarily on image editing, audio-driven video scene editing has not been extensively addressed. In this paper, we introduce AudioScenic, an audio-driven framework designed for video scene editing. AudioScenic integrates audio semantics into the visual scene through a temporal-aware audio semantic injection process. As our focus is on background editing, we further introduce a SceneMasker module, which maintains the integrity of the foreground content during the editing process. AudioScenic exploits the inherent properties of audio, namely, audio magnitude and frequency, to guide the editing process, aiming to control the temporal dynamics and enhance the temporal consistency. First, we present an audio Magnitude Modulator module that adjusts the temporal dynamics of the scene in response to changes in audio magnitude, enhancing the visual dynamics. Second, the audio Frequency Fuser module is designed to ensure temporal consistency by aligning the frequency of the audio with the dynamics of the video scenes, thus improving the overall temporal coherence of the edited videos. These integrated features enable AudioScenic to not only enhance visual diversity but also maintain temporal consistency throughout the video. We present a new metric named temporal score for more comprehensive validation of temporal consistency. We demonstrate substantial advancements of AudioScenic over competing methods on DAVIS and Audioset datasets.

AudioScenic: Audio-Driven Video Scene Editing

TL;DR

AudioScenic introduces audio-driven video scene editing to dynamically modulate background visuals while preserving foreground content. It integrates temporal-aware audio semantics via a novel injection method (TASI) and adds SceneMasker, Magnitude Modulator, and Frequency Fuser to control temporal dynamics and ensure coherence. A new temporal score metric assesses temporal consistency, and experiments on DAVIS and Audioset show substantial improvements over text-driven and audio-driven baselines. The approach enables diverse, audio-synchronized video edits with robust foreground preservation, scalable through a Stable Diffusion-based latent diffusion framework.

Abstract

Audio-driven visual scene editing endeavors to manipulate the visual background while leaving the foreground content unchanged, according to the given audio signals. Unlike current efforts focusing primarily on image editing, audio-driven video scene editing has not been extensively addressed. In this paper, we introduce AudioScenic, an audio-driven framework designed for video scene editing. AudioScenic integrates audio semantics into the visual scene through a temporal-aware audio semantic injection process. As our focus is on background editing, we further introduce a SceneMasker module, which maintains the integrity of the foreground content during the editing process. AudioScenic exploits the inherent properties of audio, namely, audio magnitude and frequency, to guide the editing process, aiming to control the temporal dynamics and enhance the temporal consistency. First, we present an audio Magnitude Modulator module that adjusts the temporal dynamics of the scene in response to changes in audio magnitude, enhancing the visual dynamics. Second, the audio Frequency Fuser module is designed to ensure temporal consistency by aligning the frequency of the audio with the dynamics of the video scenes, thus improving the overall temporal coherence of the edited videos. These integrated features enable AudioScenic to not only enhance visual diversity but also maintain temporal consistency throughout the video. We present a new metric named temporal score for more comprehensive validation of temporal consistency. We demonstrate substantial advancements of AudioScenic over competing methods on DAVIS and Audioset datasets.
Paper Structure (17 sections, 9 equations, 8 figures, 2 tables)

This paper contains 17 sections, 9 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Editing results of AudioScenic. Given a source video (top row), our approach can perform scene editing using various audio clips while preserving the foreground content. Moreover, we can conduct scene-style transitions within emotional audio clips like happy music or sad music.
  • Figure 2: Framework of AudioScenic (§\ref{['sec: framework']}). In the fine-tuning stage, given a source video within an audio clip, we invert the video to noisy latent using DDIM inversion. We obtain semantic embedding, magnitude feature, and frequency feature from the audio clip. We fuse audio semantic embedding with timestep embedding derived from timestep $t$ for guiding the latent denoising process. The audio magnitude feature $\bm{M}_a$ and frequency feature $\bm{F}_a$ are used for the Magnitude Modulator and the Frequency Fuser, respectively. We compute the reconstruction loss for fine-tuning. In the Inference stage, we input a new audio clip for guidance. The depth and binary masks are additionally used to preserve the foreground content.
  • Figure 3: Comparison with baselines through magnitude control (§\ref{['sec: baseline']}). The semantic label of input audio is " Cracking fire".
  • Figure 4: Qualitative comparison with Sound-G soundguided (audio-driven image editing method) (§\ref{['sec: baseline']}). The semantic label of input audio is "Sea wave".
  • Figure 5: Qualitative comparison (§\ref{['sec: baseline']}). We compare results between (a-c) text-driven methods Tunefatezerovideop2p and (d) ours. For (a-c), the input text is "a jeep car is driving beside the sea wave". For ours, the semantic label of audio is "Sea wave".
  • ...and 3 more figures