Table of Contents
Fetching ...

OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Yongsheng Yu, Ziyun Zeng, Haitian Zheng, Jiebo Luo

TL;DR

OmniPaint addresses the challenge of realistic object editing by unifying object removal and insertion as interdependent tasks. It leverages a pre-trained diffusion prior, a three-phase training pipeline, CycleFlow unpaired refinement, and a no-reference CFD metric to ensure geometric and physical consistency while reducing data requirements. Empirical results show substantial gains over state-of-the-art baselines in both removal and insertion, with CFD providing robust, reference-free evaluation of context coherence and hallucination. The work paves the way for practical, high-fidelity object-oriented editing with limited paired data, and introduces a flexible framework for future extension to more complex scenes and modalities.

Abstract

Diffusion-based generative models have revolutionized object-oriented image editing, yet their deployment in realistic object removal and insertion remains hampered by challenges such as the intricate interplay of physical effects and insufficient paired training data. In this work, we introduce OmniPaint, a unified framework that re-conceptualizes object removal and insertion as interdependent processes rather than isolated tasks. Leveraging a pre-trained diffusion prior along with a progressive training pipeline comprising initial paired sample optimization and subsequent large-scale unpaired refinement via CycleFlow, OmniPaint achieves precise foreground elimination and seamless object insertion while faithfully preserving scene geometry and intrinsic properties. Furthermore, our novel CFD metric offers a robust, reference-free evaluation of context consistency and object hallucination, establishing a new benchmark for high-fidelity image editing. Project page: https://yeates.github.io/OmniPaint-Page/

OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

TL;DR

OmniPaint addresses the challenge of realistic object editing by unifying object removal and insertion as interdependent tasks. It leverages a pre-trained diffusion prior, a three-phase training pipeline, CycleFlow unpaired refinement, and a no-reference CFD metric to ensure geometric and physical consistency while reducing data requirements. Empirical results show substantial gains over state-of-the-art baselines in both removal and insertion, with CFD providing robust, reference-free evaluation of context coherence and hallucination. The work paves the way for practical, high-fidelity object-oriented editing with limited paired data, and introduces a flexible framework for future extension to more complex scenes and modalities.

Abstract

Diffusion-based generative models have revolutionized object-oriented image editing, yet their deployment in realistic object removal and insertion remains hampered by challenges such as the intricate interplay of physical effects and insufficient paired training data. In this work, we introduce OmniPaint, a unified framework that re-conceptualizes object removal and insertion as interdependent processes rather than isolated tasks. Leveraging a pre-trained diffusion prior along with a progressive training pipeline comprising initial paired sample optimization and subsequent large-scale unpaired refinement via CycleFlow, OmniPaint achieves precise foreground elimination and seamless object insertion while faithfully preserving scene geometry and intrinsic properties. Furthermore, our novel CFD metric offers a robust, reference-free evaluation of context consistency and object hallucination, establishing a new benchmark for high-fidelity image editing. Project page: https://yeates.github.io/OmniPaint-Page/

Paper Structure

This paper contains 21 sections, 15 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of OmniPaint for object-oriented editing, including realistic object removal (left) and generative object insertion (right). Masked regions are shown as semi-transparent overlays. In removal cases, the $\times$ marks the target object and its physical effects, such as reflections, with the right column showing the results. In insertion cases, the reference object (inset) is placed into the scene, indicated by a green arrow. Note that for model input, masked regions are fully removed rather than semi-transparent.
  • Figure 2: Visualization of CFD metric assessment for object removal. The segmentation results are obtained using SAM sam1 with refinement, with purple masks for background, orange masks for segments fully within the original mask, and unmasked for those extending beyond the original mask. Note that the orange masked regions correspond to hallucinated objects. A higher ReMOVE REMOVE score is better, while a lower CFD score is preferable. In these cases, ReMOVE scores are too similar to indicate removal success, while CFD score offers a clearer distinction.
  • Figure 3: Illustration of the proposed CFD metric for evaluating object removal quality. Left: We apply SAM to segment the inpainted image into object masks and classify them into nested ($\Omega_{\mathcal{M}^{n}}$) and overlapping ($\Omega_{\mathcal{M}^{o}}$) masks. Middle: The context coherence term measures the feature deviation between the inpainted region ($\Omega_{\mathbf{M}}$) and its surrounding background ($\Omega_{\mathbf{B} \setminus \mathbf{M}}$) in the DINOv2 feature space. Right: The hallucination penalty is computed by comparing deep features of detected nested objects ($\Omega_{\mathcal{M}^{n}}$) with their adjacent overlapping masks ($\Omega_{\mathcal{M}^{o}}$) to assess whether unwanted object-like structures have emerged.
  • Figure 4: Illustration of CycleFlow. The mapping $F$ removes the object, predicting an estimated target $\mathbf{z}_1'$, while $G$ reinserts the object, generating estimated target $\overline{\mathbf{z}}_1$. Cycle consistency is enforced by ensuring $G$ reconstructs the original latent $\mathbf{z}_1$ from the effect removal output. Dashed arrows indicate the cycle loss supervision.
  • Figure 5: Qualitative comparison on object insertion. Given masked images and reference object images (top row), we compare results from AnyDoor anydoor, IMPRINT imprint, and OmniPaint.
  • ...and 2 more figures