Table of Contents
Fetching ...

Object-AVEdit: An Object-level Audio-Visual Editing Model

Youquan Fu, Ruiyang Si, Hongfa Wang, Dongzhan Zhou, Jiacheng Sun, Ping Luo, Di Hu, Hongyuan Zhang, Xuelong Li

TL;DR

This work tackles object-level editing across audio and video by enabling addition, replacement, and removal of objects while preserving structural consistency. It proposes Object-AVEdit, which combines a word-to-object aligned audio generation model with an inversion-regeneration holistically-optimized editing algorithm that uses attention-control to preserve structure and enhance realism. The audio generator ties word-level text embeddings to sounding objects and is trained with a VAE/DiT/vocoder stack and a Flow Matching scheduler, while the editing pipeline performs repeated inversions and mid-step velocity-based regeneration to improve fidelity. Extensive experiments on dedicated audio-visual editing benchmarks show superior cross-modal semantic alignment and editing quality compared to baselines, along with strong audio generation performance. The approach promises practical impact for film and video post-production by enabling intuitive, precise object-level edits across both modalities.

Abstract

There is a high demand for audio-visual editing in video post-production and the film making field. While numerous models have explored audio and video editing, they struggle with object-level audio-visual operations. Specifically, object-level audio-visual editing requires the ability to perform object addition, replacement, and removal across both audio and visual modalities, while preserving the structural information of the source instances during the editing process. In this paper, we present \textbf{Object-AVEdit}, achieving the object-level audio-visual editing based on the inversion-regeneration paradigm. To achieve the object-level controllability during editing, we develop a word-to-sounding-object well-aligned audio generation model, bridging the gap in object-controllability between audio and current video generation models. Meanwhile, to achieve the better structural information preservation and object-level editing effect, we propose an inversion-regeneration holistically-optimized editing algorithm, ensuring both information retention during the inversion and better regeneration effect. Extensive experiments demonstrate that our editing model achieved advanced results in both audio-video object-level editing tasks with fine audio-visual semantic alignment. In addition, our developed audio generation model also achieved advanced performance. More results on our project page: https://gewu-lab.github.io/Object_AVEdit-website/.

Object-AVEdit: An Object-level Audio-Visual Editing Model

TL;DR

This work tackles object-level editing across audio and video by enabling addition, replacement, and removal of objects while preserving structural consistency. It proposes Object-AVEdit, which combines a word-to-object aligned audio generation model with an inversion-regeneration holistically-optimized editing algorithm that uses attention-control to preserve structure and enhance realism. The audio generator ties word-level text embeddings to sounding objects and is trained with a VAE/DiT/vocoder stack and a Flow Matching scheduler, while the editing pipeline performs repeated inversions and mid-step velocity-based regeneration to improve fidelity. Extensive experiments on dedicated audio-visual editing benchmarks show superior cross-modal semantic alignment and editing quality compared to baselines, along with strong audio generation performance. The approach promises practical impact for film and video post-production by enabling intuitive, precise object-level edits across both modalities.

Abstract

There is a high demand for audio-visual editing in video post-production and the film making field. While numerous models have explored audio and video editing, they struggle with object-level audio-visual operations. Specifically, object-level audio-visual editing requires the ability to perform object addition, replacement, and removal across both audio and visual modalities, while preserving the structural information of the source instances during the editing process. In this paper, we present \textbf{Object-AVEdit}, achieving the object-level audio-visual editing based on the inversion-regeneration paradigm. To achieve the object-level controllability during editing, we develop a word-to-sounding-object well-aligned audio generation model, bridging the gap in object-controllability between audio and current video generation models. Meanwhile, to achieve the better structural information preservation and object-level editing effect, we propose an inversion-regeneration holistically-optimized editing algorithm, ensuring both information retention during the inversion and better regeneration effect. Extensive experiments demonstrate that our editing model achieved advanced results in both audio-video object-level editing tasks with fine audio-visual semantic alignment. In addition, our developed audio generation model also achieved advanced performance. More results on our project page: https://gewu-lab.github.io/Object_AVEdit-website/.

Paper Structure

This paper contains 20 sections, 12 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Object-AVEdit model provides object-level editing capability on audio-visual data. Users can implement the object-level editing operations like (a) object addition, (b) object removal, and (c, d) object replacement on audio-visual pairs with Object-AVEdit.
  • Figure 2: Editing pipeline of Object-AVEdit. The Object-AVEdit edits audio-visual data by first turning the original video and audio into noise. Then, the target prompt will be used to regenerate the semantically aligned edited video and audio, while preserving the original structure. In the regeneration process, our developed audio generation model is used to ensure the accessibility of the object-level attention maps. And inversion-regeneration holistically-optimized editing algorithm is applied to ensure both the structural information preservation during inversion and high regeneration quality.
  • Figure 3: Performance of different audio editing methods on the addition, replacement and removal tasks. The prompts of the original audios and the desired edited audios are: (a) Dog bark. $\rightarrow$ Dog bark with raining. (b) Dog. $\rightarrow$ Pig. (c) Lion roar with raining. $\rightarrow$ Lion roar. From the Mel spectrograms, Object-AVEdit successfully edits the audio with preserving the structural information, which shows significant superiority.
  • Figure 4: Performance of different video editing methods on the addition, replacement and removal tasks. The prompts of the original videos and the desired edited videos are: (a) A brindle dog standing on dry grass with a gray road above. $\rightarrow$ A brindle dog on dry grass with a gray road above in the rain. (b) A cat in the classroom. $\rightarrow$ A dog in the classroom. (c) A yellow dog on gray floor tiles beside a white cabinet $\rightarrow$ gray floor tiles beside a white cabinet. From the videos, Object-AVEdit achieves advanced effect in the object-level video editing tasks.
  • Figure 5: Effectiveness of Object-AVEdit on diverse examples.