Table of Contents
Fetching ...

Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation

Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu

TL;DR

A new diffusion-based framework for joint audio-visual editing is proposed and a cross-modal semantic enhancement approach is introduced to enhance semantic consistency between language and vision to mitigate catastrophic neglect during content editing.

Abstract

In this paper, we introduce a novel task called language-guided joint audio-visual editing. Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on the language guidance. For instance, we can alter the background environment of a sounding object while keeping its appearance unchanged, or we can add new sounds contextualized to the visual content. To address this task, we propose a new diffusion-based framework for joint audio-visual editing and introduce two key ideas. Firstly, we propose a one-shot adaptation approach to tailor generative diffusion models for audio-visual content editing. With as few as one audio-visual sample, we jointly transfer the audio and vision diffusion models to the target domain. After fine-tuning, our model enables consistent generation of this audio-visual sample. Secondly, we introduce a cross-modal semantic enhancement approach. We observe that when using language as content editing guidance, the vision branch may overlook editing requirements. This phenomenon, termed catastrophic neglect, hampers audio-visual alignment during content editing. We therefore enhance semantic consistency between language and vision to mitigate this issue. Extensive experiments validate the effectiveness of our method in language-based audio-visual editing and highlight its superiority over several baseline approaches. We recommend that readers visit our project page for more details: https://liangsusan-git.github.io/project/avedit/.

Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation

TL;DR

A new diffusion-based framework for joint audio-visual editing is proposed and a cross-modal semantic enhancement approach is introduced to enhance semantic consistency between language and vision to mitigate catastrophic neglect during content editing.

Abstract

In this paper, we introduce a novel task called language-guided joint audio-visual editing. Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on the language guidance. For instance, we can alter the background environment of a sounding object while keeping its appearance unchanged, or we can add new sounds contextualized to the visual content. To address this task, we propose a new diffusion-based framework for joint audio-visual editing and introduce two key ideas. Firstly, we propose a one-shot adaptation approach to tailor generative diffusion models for audio-visual content editing. With as few as one audio-visual sample, we jointly transfer the audio and vision diffusion models to the target domain. After fine-tuning, our model enables consistent generation of this audio-visual sample. Secondly, we introduce a cross-modal semantic enhancement approach. We observe that when using language as content editing guidance, the vision branch may overlook editing requirements. This phenomenon, termed catastrophic neglect, hampers audio-visual alignment during content editing. We therefore enhance semantic consistency between language and vision to mitigate this issue. Extensive experiments validate the effectiveness of our method in language-based audio-visual editing and highlight its superiority over several baseline approaches. We recommend that readers visit our project page for more details: https://liangsusan-git.github.io/project/avedit/.

Paper Structure

This paper contains 15 sections, 7 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: We propose a novel language-guided joint audio-visual editing approach that allows users to edit their own sounding objects conditioned on various instructions. It only requires as few as one audio-visual pair for adaptation and enables the generation of new audio-visual instances based on creative text prompts. For example, after we update diffusion models with the user-provided "a bird is chirping" data, we can easily generate the image of a bird chirping under water and synthesize the same chirping sound mixed with the audio of water bubbles and wave crashing, conditioned on the free-form text prompt: "a bird is chirping under water."
  • Figure 2: Our framework for language-guided audio-visual editing. During training, we extract unimodal information from the audio-visual sample using pretrained encoders. Then, we fuse audio and visual features with an MLP and feed the output along with the text prompt into the text encoder. The text encoder generates textual conditions to guide the audio-visual diffusion model. We update the parameters of the MLP and diffusion models. During inference, we freeze all parameters of our model. We replace the training prompt with an editing prompt, e.g., we append "beside a crackling fireplace" to the training prompt "a telephone is raining." We inject the cross-model semantic enhancement module into the vision branch to improve semantic consistency. The generated audio and image accurately reflect the editing requirements.
  • Figure 3: Multimodal one-shot adaptation. We extract meaningful audio-visual representations from user-provided data. We incorporate representations into textual embeddings and feed them to the text model to generate multimodal conditions.
  • Figure 4: Cross-modal semantic enhancement. The vision model tends to neglect the editing requirements while the audio model can accurately generate targeted content. We adjust the weights of vision-language attention maps to mitigate this issue. Eventually, we achieve consistent audio-visual content editing conditioned on language.
  • Figure 5: We show some samples from the OAVE dataset, including animals, vehicles, tools, natural phenomena, musical instruments, and human speech.
  • ...and 2 more figures