Table of Contents
Fetching ...

From Ideal to Real: Stable Video Object Removal under Imperfect Conditions

Jiagao Hu, Yuxuan Chen, Fuhao Li, Zepeng Wang, Fei Wang, Daiguo Zhou, Jian Luan

TL;DR

Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications.

Abstract

Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: (1) Mask Union for Stable Erasure (MUSE), a windowed union strategy applied during temporal mask downsampling to preserve all target regions observed within each window, effectively handling abrupt motion and reducing missed removals; (2) Denoising-Aware Segmentation (DA-Seg), a lightweight segmentation head on a decoupled side branch equipped with Denoising-Aware AdaLN and trained with mask degradation to provide an internal diffusion-aware localization prior without affecting content generation; and (3) Curriculum Two-Stage Training: where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks to learn realistic background and temporal priors, and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses, jointly removing objects and their associated shadows/reflections while improving cross-domain robustness. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications.

From Ideal to Real: Stable Video Object Removal under Imperfect Conditions

TL;DR

Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications.

Abstract

Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: (1) Mask Union for Stable Erasure (MUSE), a windowed union strategy applied during temporal mask downsampling to preserve all target regions observed within each window, effectively handling abrupt motion and reducing missed removals; (2) Denoising-Aware Segmentation (DA-Seg), a lightweight segmentation head on a decoupled side branch equipped with Denoising-Aware AdaLN and trained with mask degradation to provide an internal diffusion-aware localization prior without affecting content generation; and (3) Curriculum Two-Stage Training: where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks to learn realistic background and temporal priors, and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses, jointly removing objects and their associated shadows/reflections while improving cross-domain robustness. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications.
Paper Structure (54 sections, 7 equations, 13 figures, 8 tables)

This paper contains 54 sections, 7 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Results of our Stable Video Object Removal compared with MiniMax-Remover zi2025minimax and ROSE miao2025rose in three common real-world challenges. The proposed SVOR achieves stable and artifact-free removal.
  • Figure 2: The framework of SVOR. Stage I: pretrain on unpaired real-world background videos using Random Mask Strategy to simulate object motion. Stage II: refine on paired synthetic data with Mask Degradation to mimic imperfect masks, where DA-Seg complements defective guidance. MUSE performs windowed union retention during mask temporal downsampling, preventing loss of dynamic location information.
  • Figure 3: Qualitative comparison between our SVOR and several state-of-the-art methods on real-world and synthetic samples. Previous methods facing issues like Undesired object, Artifacts, Blur, Undesired remove, Unremoved shadow, Unremoved effects. Our SVOR achieves consistently cleaner removal, fewer artifacts, and better shadow handling.
  • Figure 4: Effect of MUSE under abrupt-motion frames. MUSE improves removal even without additional training. "T”/"I” denote Training/Inference, "$\times$”/"$\checkmark$” indicate without/with MUSE.
  • Figure 5: Robust removal under SAM2 failures. Existing methods miss unsegmented objects when SAM2 drops, while our SVOR still achieves temporally consistent removal.
  • ...and 8 more figures