Table of Contents
Fetching ...

VOID: Video Object and Interaction Deletion

Saman Motamed, William Harvey, Benjamin Klein, Luc Van Gool, Zhuoning Yuan, Ta-Ying Cheng

Abstract

Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.

VOID: Video Object and Interaction Deletion

Abstract

Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.

Paper Structure

This paper contains 36 sections, 3 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Removing an object and its interactions can require rewriting the entire scene. On the left, when the middle three blocks are removed, VOID correctly models the domino effect halting so that the yellow block never falls. On the right, when the hands are removed, VOID correctly models the spinning tops continuing without interruption.
  • Figure 2: Counterfactual supervision examples. Top: videos $\mathbf{V}$ where $O$ is outlined in red. Bottom: re-simulated counterfactuals $\hat{\mathbf{V}}$ generated without $O$. In Kubric scenes, downstream motion changes when the initiating object is removed. In HUMOTO scenes, supported objects transition naturally under gravity.
  • Figure 3: Void: Interaction-Aware Counterfactual Video Generation. A user provides an input video and clicks on an object to mask it for removal. A VLM-based pipeline expands the mask to identify other areas that will be affected. VOID's first pass then predicts a counterfactual trajectory. The optional second pass stabilizes object deformation using flow-warped noise derived from the initially predicted motion.
  • Figure 4: Frames from generated videos featuring a guitar entering free-fall and a thrown ball following a new trajectory. Pass 1 (left) produces correct counterfactual trajectories but exhibits structural deformation. Pass 2 (right) better preserves object rigidity by using motion-aligned warped noise.
  • Figure 5: Qualitative comparisons on real-world videos. VOID maintains object structure and produces plausible motion over time, while the baselines exhibit deformation (the kettlebell on the pillow, and the floaty deforming), incomplete removal (two cars crashing), or implausible outputs (paint appearing after the roller is removed).
  • ...and 4 more figures