Table of Contents
Fetching ...

O-DisCo-Edit: Object Distortion Control for Unified Realistic Video Editing

Yuqing Chen, Junjie Wang, Lin Liu, Ruihang Chu, Xiaopeng Zhang, Qi Tian, Yujiu Yang

TL;DR

The paper tackles the difficulty of controllable video editing with diffusion models by introducing O-DisCo-Edit, a unified framework that uses a noise-based object distortion signal (O-DisCo) to encompass diverse editing cues. It pairs O-DisCo with a Copy-Form Preservation module and an Identity Preservation module to maintain unedited regions and object identity, respectively. A training-time random distorter (R-O-DisCo) and an inference-time adaptive distorter (A-O-DisCo) enable multi-granularity control during editing. Through extensive experiments across eight tasks and thorough ablations, the approach achieves state-of-the-art results on most benchmarks and demonstrates improved efficiency over prior multi-task and specialized models. This work proposes a new paradigm where a single unified control signal can drive flexible, high-fidelity video editing with reduced resource demands.

Abstract

Diffusion models have recently advanced video editing, yet controllable editing remains challenging due to the need for precise manipulation of diverse object properties. Current methods require different control signal for diverse editing tasks, which complicates model design and demands significant training resources. To address this, we propose O-DisCo-Edit, a unified framework that incorporates a novel object distortion control (O-DisCo). This signal, based on random and adaptive noise, flexibly encapsulates a wide range of editing cues within a single representation. Paired with a "copy-form" preservation module for preserving non-edited regions, O-DisCo-Edit enables efficient, high-fidelity editing through an effective training paradigm. Extensive experiments and comprehensive human evaluations consistently demonstrate that O-DisCo-Edit surpasses both specialized and multitask state-of-the-art methods across various video editing tasks. https://cyqii.github.io/O-DisCo-Edit.github.io/

O-DisCo-Edit: Object Distortion Control for Unified Realistic Video Editing

TL;DR

The paper tackles the difficulty of controllable video editing with diffusion models by introducing O-DisCo-Edit, a unified framework that uses a noise-based object distortion signal (O-DisCo) to encompass diverse editing cues. It pairs O-DisCo with a Copy-Form Preservation module and an Identity Preservation module to maintain unedited regions and object identity, respectively. A training-time random distorter (R-O-DisCo) and an inference-time adaptive distorter (A-O-DisCo) enable multi-granularity control during editing. Through extensive experiments across eight tasks and thorough ablations, the approach achieves state-of-the-art results on most benchmarks and demonstrates improved efficiency over prior multi-task and specialized models. This work proposes a new paradigm where a single unified control signal can drive flexible, high-fidelity video editing with reduced resource demands.

Abstract

Diffusion models have recently advanced video editing, yet controllable editing remains challenging due to the need for precise manipulation of diverse object properties. Current methods require different control signal for diverse editing tasks, which complicates model design and demands significant training resources. To address this, we propose O-DisCo-Edit, a unified framework that incorporates a novel object distortion control (O-DisCo). This signal, based on random and adaptive noise, flexibly encapsulates a wide range of editing cues within a single representation. Paired with a "copy-form" preservation module for preserving non-edited regions, O-DisCo-Edit enables efficient, high-fidelity editing through an effective training paradigm. Extensive experiments and comprehensive human evaluations consistently demonstrate that O-DisCo-Edit surpasses both specialized and multitask state-of-the-art methods across various video editing tasks. https://cyqii.github.io/O-DisCo-Edit.github.io/

Paper Structure

This paper contains 24 sections, 5 equations, 16 figures, 8 tables, 2 algorithms.

Figures (16)

  • Figure 1: Given a reference video and image (typically the edited first frame), our method generates more realistic edited videos than SOTA approaches (VACE, Senorita and VideoPainter) across various tasks, including object removal, swap, object inside motion transfer, and style transfer. Zoom in to examine the visualization results. The bottom right of the reference video shows the input masks for all models, while the bottom right of our result displays our proposed novel control signal.
  • Figure 2: Comparisons of different object properties, control signals, and models.
  • Figure 3: The framework of the proposed O-DisCo-Edit. (a) Reference video. (b) Reference image (first frame during training, edited image during inference). (c) Masks. (d) R-O-DisCo. (e) A-O-DisCo. (f) Generated video. (g) Latent of reference video. (h) Latent of the preserved region. (i) Image latent with zero-padding. (m) Noisy Latent. (n) Image Latent with the latent of preserved region. $\alpha$ represents the contrast, $\sigma$ represents the intensity of the added noise, and ${k}$ is the size of the gaussian blur kernel. The adaptive distorter generates A-O-DisCo for inference, and the random distorter generates R-O-DisCo for training. The CFP ensures the preservation of unedited areas. The IDP maintains object appearance consistency.
  • Figure 4: Our O-DisCo-Edit method is compared against other baselines for addition, color change, light transfer, and object removal. The bottom right of the reference video displays the input masks utilized by all models, while the corresponding position in our results highlights the A-O-DisCo required by our approach.
  • Figure 5: A comparison of O-DisCo-Edit and other baselines on the outpainting task. In the second row, the top right of the reference video displays the input masks utilized by all models, while the same position in our results highlights the A-O-DisCo required by our approach. The blue arrow indicates the source region for the magnified view presented in the first row.
  • ...and 11 more figures