Table of Contents
Fetching ...

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

Yan-Bo Lin, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, Xiaofei Wang, Gedas Bertasius, Lijuan Wang

TL;DR

The paper tackles zero-shot audio-video editing by introducing AvED, a cross-modal delta denoising framework that jointly edits audio and video using cross-modal attention and a contrastive loss. It defines AvED-Bench, a challenging benchmark of 110 VGGSound-based videos with prompts, and demonstrates strong improvements over state-of-the-art baselines on both AvED-Bench and the OAVE dataset, highlighting better coherence, synchronization, and perceptual fidelity. The key contributions are the cross-modal delta denoising scheme, the formulation of prompt-relevant patch sampling with a cross-modal contrastive loss, and the extensive evaluation showing substantial gains in AV alignment and visual/audio quality. The work underscores the importance of joint cross-modal supervision for realistic editing of multimedia content without additional training, with practical implications for content creation and multimodal video production.

Abstract

In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. Results are available at https://genjib.github.io/project_page/AVED/index.html

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

TL;DR

The paper tackles zero-shot audio-video editing by introducing AvED, a cross-modal delta denoising framework that jointly edits audio and video using cross-modal attention and a contrastive loss. It defines AvED-Bench, a challenging benchmark of 110 VGGSound-based videos with prompts, and demonstrates strong improvements over state-of-the-art baselines on both AvED-Bench and the OAVE dataset, highlighting better coherence, synchronization, and perceptual fidelity. The key contributions are the cross-modal delta denoising scheme, the formulation of prompt-relevant patch sampling with a cross-modal contrastive loss, and the extensive evaluation showing substantial gains in AV alignment and visual/audio quality. The work underscores the importance of joint cross-modal supervision for realistic editing of multimedia content without additional training, with practical implications for content creation and multimodal video production.

Abstract

In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. Results are available at https://genjib.github.io/project_page/AVED/index.html

Paper Structure

This paper contains 31 sections, 7 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Key Challenges in Joint Audio-Video Editing. Existing methods primarily focus on zero-shot text-to-video cohen2024slicedityang2023rerender_a_videoliu2024video_p2p or text-to-audio zs_audio_ddpmwang2023auditaudioeditor editing separately. Solely editing only video or only audio often leads to coherence and synchronization issues between two modalities. As highlighted in red circle, the motion or presence of sounding objects may not align with the corresponding audio. Additionally, edited content may exhibit audio artifacts along the temporal dimension (shown in the purple squares). These factors make the edited results feel less natural and cohesive. In contrast, our AvED jointly edits audio and video by leveraging cross-modal information as additional supervision to improve editing quality to alleviate synchronization issues.
  • Figure 2: Our AvED Framework.AvED performs zero-shot audio-video editing by employing a cross-modal delta denoising score scheme to edit audio and video based on target prompts jointly. During the denoising process, relevance scores are computed between audio/image regions and target textual prompts within the cross-attention module from the diffusion model. These scores identify prompt-relevant regions (i.e., blue areas) and irrelevant patches, allowing selective editing of specific regions while preserving unaltered content. Using this region information (obtained by randomly sampling patch indices), we define positive pairs as unaltered content consistent in both the source and target branches and regions requiring edits across audio and video modalities. All other pairs are treated as negative pairs. This design enables synchronized, high-fidelity edits aligned with target prompts, maintaining coherence across audio and video.
  • Figure 3: Human Evaluation. Human raters evaluate edited audio and video quality based on alignment with the target text prompt. We report the average human preference rate for each method. All samples are presented in a random order to ensure unbiased assessment.
  • Figure 4: Qualitative Zero-Shot Audio-Video Editing Results. We present qualitative results of audio-video editing for a video depicting a transition from "Cat" to "Dog." AvED is compared with video models, including ControlVideo zhang2023controlvideo, TokenFlow tokenflow, and RAVE rave, along with the audio model ZEUS zs_audio_ddpm. The green circles highlight well-aligned motion matches in the video frames, while the black rectangles emphasize precise audio matching. The blue rectangles indicate audio artifacts in the competing models, leading to the misalignment between video actions and audio output.
  • Figure 5: Category Distribution of AvED-Bench. We present the source and target category distribution of the AvED-Bench dataset. The source categories represent the initial categories, while the target categories indicate their edited categories. This distribution highlights AvED-Bench 's capability to effectively evaluate a variety of audio-video editing.
  • ...and 3 more figures