Table of Contents
Fetching ...

MiniMax-Remover: Taming Bad Noise Helps Video Object Removal

Bojia Zi, Weixuan Peng, Xianbiao Qi, Jianan Wang, Shihao Zhao, Rong Xiao, Kam-Fai Wong

TL;DR

<3-5 sentence high-level summary>MiniMax-Remover tackles video object removal with a two-stage diffusion framework that minimizes reliance on classifier-free guidance and accelerates inference. Stage 1 builds a lightweight DiT-based remover using contrastive conditioning and cross-attention removal, while Stage 2 applies a human-guided minimax distillation to harden the model against adversarial noise, enabling CFG-free, fast inference with as few as 6 sampling steps. Extensive experiments on DAVIS and Pexels show state-of-the-art removal quality, strong temporal consistency, and favorable GPT-O3-based evaluations. The approach offers practical impact for real-time or resource-constrained video editing by delivering high-quality removals with reduced computational cost and artifacts.

Abstract

Recent advances in video diffusion models have driven rapid progress in video editing techniques. However, video object removal, a critical subtask of video editing, remains challenging due to issues such as hallucinated objects and visual artifacts. Furthermore, existing methods often rely on computationally expensive sampling procedures and classifier-free guidance (CFG), resulting in slow inference. To address these limitations, we propose MiniMax-Remover, a novel two-stage video object removal approach. Motivated by the observation that text condition is not best suited for this task, we simplify the pretrained video generation model by removing textual input and cross-attention layers, resulting in a more lightweight and efficient model architecture in the first stage. In the second stage, we distilled our remover on successful videos produced by the stage-1 model and curated by human annotators, using a minimax optimization strategy to further improve editing quality and inference speed. Specifically, the inner maximization identifies adversarial input noise ("bad noise") that makes failure removals, while the outer minimization step trains the model to generate high-quality removal results even under such challenging conditions. As a result, our method achieves a state-of-the-art video object removal results with as few as 6 sampling steps and doesn't rely on CFG, significantly improving inference efficiency. Extensive experiments demonstrate the effectiveness and superiority of MiniMax-Remover compared to existing methods. Codes and Videos are available at: https://minimax-remover.github.io.

MiniMax-Remover: Taming Bad Noise Helps Video Object Removal

TL;DR

<3-5 sentence high-level summary>MiniMax-Remover tackles video object removal with a two-stage diffusion framework that minimizes reliance on classifier-free guidance and accelerates inference. Stage 1 builds a lightweight DiT-based remover using contrastive conditioning and cross-attention removal, while Stage 2 applies a human-guided minimax distillation to harden the model against adversarial noise, enabling CFG-free, fast inference with as few as 6 sampling steps. Extensive experiments on DAVIS and Pexels show state-of-the-art removal quality, strong temporal consistency, and favorable GPT-O3-based evaluations. The approach offers practical impact for real-time or resource-constrained video editing by delivering high-quality removals with reduced computational cost and artifacts.

Abstract

Recent advances in video diffusion models have driven rapid progress in video editing techniques. However, video object removal, a critical subtask of video editing, remains challenging due to issues such as hallucinated objects and visual artifacts. Furthermore, existing methods often rely on computationally expensive sampling procedures and classifier-free guidance (CFG), resulting in slow inference. To address these limitations, we propose MiniMax-Remover, a novel two-stage video object removal approach. Motivated by the observation that text condition is not best suited for this task, we simplify the pretrained video generation model by removing textual input and cross-attention layers, resulting in a more lightweight and efficient model architecture in the first stage. In the second stage, we distilled our remover on successful videos produced by the stage-1 model and curated by human annotators, using a minimax optimization strategy to further improve editing quality and inference speed. Specifically, the inner maximization identifies adversarial input noise ("bad noise") that makes failure removals, while the outer minimization step trains the model to generate high-quality removal results even under such challenging conditions. As a result, our method achieves a state-of-the-art video object removal results with as few as 6 sampling steps and doesn't rely on CFG, significantly improving inference efficiency. Extensive experiments demonstrate the effectiveness and superiority of MiniMax-Remover compared to existing methods. Codes and Videos are available at: https://minimax-remover.github.io.

Paper Structure

This paper contains 29 sections, 17 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Visual Results of MiniMax-Remover. The left side displays the original videos, while the right side shows the edited results. Our method achieves high-quality removal of the target objects: the girl, chameleon, bird, lane line, and red wine glass, as illustrated in the five corresponding video examples. Best viewed with Acrobat Reader. Click the images to play the animations.
  • Figure 2: The comparison between different blocks. (a) the original Wan2.1 DiT block; (b) DiT block with contrastive tokens (positive or negative token); (c) the block with removing the CFG.
  • Figure 3: The pipeline of our two-stage method.
  • Figure 4: The visual results of our object remover. The video on the left depicts the original video, while the video on the right displays the edited videos. Best viewed with Acrobat Reader. Click the images to play the animation clips.
  • Figure 5: Training framework of the Stage-1. (a) denotes the positive condition process, and the position condition learns to remove the masked objects. and (b) represents the negative process, and the negative condition learns to generate the masked objects..
  • ...and 3 more figures