Table of Contents
Fetching ...

Towards Enhanced Image Inpainting: Mitigating Unwanted Object Insertion and Preserving Color Consistency

Yikai Wang, Chenjie Cao, Junqiu Yu, Ke Fan, Xiangyang Xue, Yanwei Fu

TL;DR

This work targets two persistent problems in latent inpainting with diffusion/rectified-flow models: unwanted object insertion and color inconsistency in masked regions. It introduces ASUKA, a post-training framework that uses a Masked Auto-Encoder (MAE) prior to provide a context-stable guide for frozen generators and a dedicated decoder to achieve local color harmonization during decoding. An alignment module bridges MAE priors to the frozen latent models, while training employs mask-simulated MAE data and augmentations to align distributions and dimensions without retraining the backbone. Evaluations on SD v1.5 inpainting and FLUX across Places2 and MISATO show ASUKA reduces object hallucination, improves color consistency, and achieves state-of-the-art or competitive perceptual and fidelity metrics, demonstrating a practical, plug-in improvement for high-fidelity inpainting. The approach offers a scalable pathway to higher-quality inpainted imagery without the cost of re-training large latent generative models.

Abstract

Recent advances in image inpainting increasingly use generative models to handle large irregular masks. However, these models can create unrealistic inpainted images due to two main issues: (1) Unwanted object insertion: Even with unmasked areas as context, generative models may still generate arbitrary objects in the masked region that don't align with the rest of the image. (2) Color inconsistency: Inpainted regions often have color shifts that causes a smeared appearance, reducing image quality. Retraining the generative model could help solve these issues, but it's costly since state-of-the-art latent-based diffusion and rectified flow models require a three-stage training process: training a VAE, training a generative U-Net or transformer, and fine-tuning for inpainting. Instead, this paper proposes a post-processing approach, dubbed as ASUKA (Aligned Stable inpainting with UnKnown Areas prior), to improve inpainting models. To address unwanted object insertion, we leverage a Masked Auto-Encoder (MAE) for reconstruction-based priors. This mitigates object hallucination while maintaining the model's generation capabilities. To address color inconsistency, we propose a specialized VAE decoder that treats latent-to-image decoding as a local harmonization task, significantly reducing color shifts for color-consistent inpainting. We validate ASUKA on SD 1.5 and FLUX inpainting variants with Places2 and MISATO, our proposed diverse collection of datasets. Results show that ASUKA mitigates object hallucination and improves color consistency over standard diffusion and rectified flow models and other inpainting methods.

Towards Enhanced Image Inpainting: Mitigating Unwanted Object Insertion and Preserving Color Consistency

TL;DR

This work targets two persistent problems in latent inpainting with diffusion/rectified-flow models: unwanted object insertion and color inconsistency in masked regions. It introduces ASUKA, a post-training framework that uses a Masked Auto-Encoder (MAE) prior to provide a context-stable guide for frozen generators and a dedicated decoder to achieve local color harmonization during decoding. An alignment module bridges MAE priors to the frozen latent models, while training employs mask-simulated MAE data and augmentations to align distributions and dimensions without retraining the backbone. Evaluations on SD v1.5 inpainting and FLUX across Places2 and MISATO show ASUKA reduces object hallucination, improves color consistency, and achieves state-of-the-art or competitive perceptual and fidelity metrics, demonstrating a practical, plug-in improvement for high-fidelity inpainting. The approach offers a scalable pathway to higher-quality inpainted imagery without the cost of re-training large latent generative models.

Abstract

Recent advances in image inpainting increasingly use generative models to handle large irregular masks. However, these models can create unrealistic inpainted images due to two main issues: (1) Unwanted object insertion: Even with unmasked areas as context, generative models may still generate arbitrary objects in the masked region that don't align with the rest of the image. (2) Color inconsistency: Inpainted regions often have color shifts that causes a smeared appearance, reducing image quality. Retraining the generative model could help solve these issues, but it's costly since state-of-the-art latent-based diffusion and rectified flow models require a three-stage training process: training a VAE, training a generative U-Net or transformer, and fine-tuning for inpainting. Instead, this paper proposes a post-processing approach, dubbed as ASUKA (Aligned Stable inpainting with UnKnown Areas prior), to improve inpainting models. To address unwanted object insertion, we leverage a Masked Auto-Encoder (MAE) for reconstruction-based priors. This mitigates object hallucination while maintaining the model's generation capabilities. To address color inconsistency, we propose a specialized VAE decoder that treats latent-to-image decoding as a local harmonization task, significantly reducing color shifts for color-consistent inpainting. We validate ASUKA on SD 1.5 and FLUX inpainting variants with Places2 and MISATO, our proposed diverse collection of datasets. Results show that ASUKA mitigates object hallucination and improves color consistency over standard diffusion and rectified flow models and other inpainting methods.
Paper Structure (59 sections, 1 equation, 11 figures, 11 tables)

This paper contains 59 sections, 1 equation, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Image inpainting on $1024^2$ images. ASUKA solves two issues existed in current diffusion and rectified flow inpainting models: (1) Unwanted object insertion, where randomly elements that are not aligned with the unmasked region are generated; (2) Color-inconsistency: the color shift of the generated masked region, causing smear-like traces. ASUKA proposes a post-training procedure for these models, significantly mitigates object hallucination and improves color consistency of inpainted results.
  • Figure 2: ASUKA tackles the unwanted object insertion issue by adopting the MAE to provide a stable prior for frozen latent generative models to maintain the generation capacity while mitigating object hallucination. For the color-inconsistency issue, ASUKA utilizes an inpainting-specialized decoder to achieve mask-unmask color consistency when decoding latent.
  • Figure 3: Use MAE prior for image-to-image translation (start from 80% noise rate) via SD achieves poor inpainting results.
  • Figure 4: The color shift exists in all kinds of scenarios in inpainted images, including indoor and outdoor scenes, random or continuous masks, and may cause darker or lighter color shift.
  • Figure 5: Inpainting w/ v.s. w/o latent augmentation. The latent augmentation handles the gap between generated and real latent.
  • ...and 6 more figures