Table of Contents
Fetching ...

OmniEraser: Remove Objects and Their Effects in Images with Paired Video-Frame Data

Runpu Wei, Zijin Yin, Shuo Zhang, Lanxiang Zhou, Xueyi Wang, Chao Ban, Tianwei Cao, Hao Sun, Zhongjiang He, Kongming Liang, Zhanyu Ma

TL;DR

OmniEraser tackles introducing and removing objects along with their visual effects by leveraging a large-scale, real-world paired dataset (Video4Removal) and a novel Object-Background Guidance strategy that conditions diffusion-based restoration on both the target object and its background. The method integrates LoRA adapters with a generative prior to adapt to the removal task, enabling robust suppression of shadows and reflections while preserving surrounding content. Key contributions include the Video4Removal dataset (134,281 triplets at 1920×1080), the Object-Background Guidance framework, and RemovalBench for robust evaluation; OmniEraser achieves state-of-the-art results across wild scenes and generalizes to anime styles. This approach reduces data labor, improves realism, and offers strong practical impact for high-quality image editing and post-processing.

Abstract

Inpainting algorithms have achieved remarkable progress in removing objects from images, yet still face two challenges: 1) struggle to handle the object's visual effects such as shadow and reflection; 2) easily generate shape-like artifacts and unintended content. In this paper, we propose Video4Removal, a large-scale dataset comprising over 100,000 high-quality samples with realistic object shadows and reflections. By constructing object-background pairs from video frames with off-the-shelf vision models, the labor costs of data acquisition can be significantly reduced. To avoid generating shape-like artifacts and unintended content, we propose Object-Background Guidance, an elaborated paradigm that takes both the foreground object and background images. It can guide the diffusion process to harness richer contextual information. Based on the above two designs, we present OmniEraser, a novel method that seamlessly removes objects and their visual effects using only object masks as input. Extensive experiments show that OmniEraser significantly outperforms previous methods, particularly in complex in-the-wild scenes. And it also exhibits a strong generalization ability in anime-style images. Datasets, models, and codes will be published.

OmniEraser: Remove Objects and Their Effects in Images with Paired Video-Frame Data

TL;DR

OmniEraser tackles introducing and removing objects along with their visual effects by leveraging a large-scale, real-world paired dataset (Video4Removal) and a novel Object-Background Guidance strategy that conditions diffusion-based restoration on both the target object and its background. The method integrates LoRA adapters with a generative prior to adapt to the removal task, enabling robust suppression of shadows and reflections while preserving surrounding content. Key contributions include the Video4Removal dataset (134,281 triplets at 1920×1080), the Object-Background Guidance framework, and RemovalBench for robust evaluation; OmniEraser achieves state-of-the-art results across wild scenes and generalizes to anime styles. This approach reduces data labor, improves realism, and offers strong practical impact for high-quality image editing and post-processing.

Abstract

Inpainting algorithms have achieved remarkable progress in removing objects from images, yet still face two challenges: 1) struggle to handle the object's visual effects such as shadow and reflection; 2) easily generate shape-like artifacts and unintended content. In this paper, we propose Video4Removal, a large-scale dataset comprising over 100,000 high-quality samples with realistic object shadows and reflections. By constructing object-background pairs from video frames with off-the-shelf vision models, the labor costs of data acquisition can be significantly reduced. To avoid generating shape-like artifacts and unintended content, we propose Object-Background Guidance, an elaborated paradigm that takes both the foreground object and background images. It can guide the diffusion process to harness richer contextual information. Based on the above two designs, we present OmniEraser, a novel method that seamlessly removes objects and their visual effects using only object masks as input. Extensive experiments show that OmniEraser significantly outperforms previous methods, particularly in complex in-the-wild scenes. And it also exhibits a strong generalization ability in anime-style images. Datasets, models, and codes will be published.
Paper Structure (10 sections, 2 equations, 10 figures, 4 tables)

This paper contains 10 sections, 2 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: OminiEraser vs. Other Methods. State-of-the-art methods, PowerPaint zhuang2025powerpaint, CLIPAway Ekin2024CLIPAwayHF and Attentive-Eraser sun2024attentive, tend to generate unintentional objects and struggle to remove the target object's effects, leading to unrealistic outputs. In contrast, our OmniEraser seamlessly removes target objects along with their shadows and reflections, using only object masks as input.
  • Figure 2: Illustration of Video4Removal. We capture two typical visual effects: shadows and reflections. Our input masks exclude the effect regions, encouraging models to learn the associations between objects and their effects in an end-to-end manner.
  • Figure 3: The construction pipeline of Video4Removal. We first separate all video frames into two categories: background frames and foreground frames containing moving objects. Then, each foreground frame is paired with the temporally closest background frame to form a pair. Finally, object masks for the foreground frames are obtained using off-the-shelf segmentation models. This process results in high-quality, photorealistic triplets suitable for object removal tasks.
  • Figure 4: Architecture of OmniEraser. We propose Object-Background Guidance which uses latent from object and background as joint input conditions to guide the denoising process.
  • Figure 5: Examples from RemovalBench. Each group of images shows the original image, ground truth, and object mask.
  • ...and 5 more figures