Table of Contents
Fetching ...

From Understanding to Erasing: Towards Complete and Stable Video Object Removal

Dingming Liu, Wenjing Wang, Chen Li, Jing Lyu

Abstract

Video object removal aims to eliminate target objects from videos while plausibly completing missing regions and preserving spatio-temporal consistency. Although diffusion models have recently advanced this task, it remains challenging to remove object-induced side effects (e.g., shadows, reflections, and illumination changes) without compromising overall coherence. This limitation stems from the insufficient physical and semantic understanding of the target object and its interactions with the scene. In this paper, we propose to introduce understanding into erasing from two complementary perspectives. Externally, we introduce a distillation scheme that transfers the relationships between objects and their induced effects from vision foundation models to video diffusion models. Internally, we propose a framewise context cross-attention mechanism that grounds each denoising block in informative, unmasked context surrounding the target region. External and internal guidance jointly enable our model to understand the target object, its induced effects, and the global background context, resulting in clear and coherent object removal. Extensive experiments demonstrate our state-of-the-art performance, and we establish the first real-world benchmark for video object removal to facilitate future research and community progress. Our code, data, and models are available at: https://github.com/WeChatCV/UnderEraser.

From Understanding to Erasing: Towards Complete and Stable Video Object Removal

Abstract

Video object removal aims to eliminate target objects from videos while plausibly completing missing regions and preserving spatio-temporal consistency. Although diffusion models have recently advanced this task, it remains challenging to remove object-induced side effects (e.g., shadows, reflections, and illumination changes) without compromising overall coherence. This limitation stems from the insufficient physical and semantic understanding of the target object and its interactions with the scene. In this paper, we propose to introduce understanding into erasing from two complementary perspectives. Externally, we introduce a distillation scheme that transfers the relationships between objects and their induced effects from vision foundation models to video diffusion models. Internally, we propose a framewise context cross-attention mechanism that grounds each denoising block in informative, unmasked context surrounding the target region. External and internal guidance jointly enable our model to understand the target object, its induced effects, and the global background context, resulting in clear and coherent object removal. Extensive experiments demonstrate our state-of-the-art performance, and we establish the first real-world benchmark for video object removal to facilitate future research and community progress. Our code, data, and models are available at: https://github.com/WeChatCV/UnderEraser.

Paper Structure

This paper contains 13 sections, 7 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Comparison of different method designs and their object removal results. Green represents the user-provided mask that covers the object to be removed. Details within the red and blue bounding boxes are shown in the rightmost column. Compared with DiffuEraser li2025diffueraser and ROSE miao2025rose that leave the human in the mirror and its shadow unremoved, our model successfully removes the human along with its induced effects.
  • Figure 2: Towards side-effect-aware video object removal, we propose an understanding-centric video diffusion framework. Object-Induced Relation Distillation transfers object-side-effect relational knowledge from a vision foundation model to the diffusion denoiser, while Framewise Context Cross-Attention injects frame-specific unmasked background context to stabilize inpainting and improve spatiotemporal consistency.
  • Figure 3: Understanding gap between VFM and VDM. The VFM-derived attention map highlights the object region and its induced side effects (e.g, shadow) more clearly, whereas the VDM exhibits weaker and less localized responses.
  • Figure 4: Illustration of our Keyframe-Guided Propagation for processing long videos.
  • Figure 5: Samples from our real-world benchmark for video object removal.
  • ...and 7 more figures