Table of Contents
Fetching ...

From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition

Jingxi Chen, Yixiao Zhang, Xiaoye Qian, Zongxia Li, Cornelia Fermuller, Caren Chen, Yiannis Aloimonos

TL;DR

The paper tackles image layer decomposition by reframing it as a combination of inpainting and outpainting and shows that a pre-trained diffusion-based inpainting model can be efficiently repurposed via lightweight fine-tuning to extract foreground with occlusion recovery and reconstruct the background with the object removed (Outpaint-and-Remove). It introduces a Multi-Modal Context Fusion with linear attention to preserve latent detail, a dual image-mask context, and a parameter-efficient fine-tuning regime using LoRA, along with RGBA foreground decoding. Training relies solely on public data, assembling roughly 100k image–foreground–background triplets from MULAN, LayerDiffuse, and OpenImages, enabling data-efficient learning. Empirical results on MULAN and real-world images demonstrate state-of-the-art performance for image layer decomposition and foreground removal, with substantial reductions in data and compute compared to fully fine-tuned baselines. The work enables flexible, high-quality layer editing and has practical impact for creative tools and downstream editing applications.

Abstract

Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modal context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.

From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition

TL;DR

The paper tackles image layer decomposition by reframing it as a combination of inpainting and outpainting and shows that a pre-trained diffusion-based inpainting model can be efficiently repurposed via lightweight fine-tuning to extract foreground with occlusion recovery and reconstruct the background with the object removed (Outpaint-and-Remove). It introduces a Multi-Modal Context Fusion with linear attention to preserve latent detail, a dual image-mask context, and a parameter-efficient fine-tuning regime using LoRA, along with RGBA foreground decoding. Training relies solely on public data, assembling roughly 100k image–foreground–background triplets from MULAN, LayerDiffuse, and OpenImages, enabling data-efficient learning. Empirical results on MULAN and real-world images demonstrate state-of-the-art performance for image layer decomposition and foreground removal, with substantial reductions in data and compute compared to fully fine-tuned baselines. The work enables flexible, high-quality layer editing and has practical impact for creative tools and downstream editing applications.

Abstract

Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modal context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.

Paper Structure

This paper contains 28 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: The top illustrates the original inpainting functionality of the pre-trained Inpainting DiT. The bottom shows our adapted pipeline for the image layer decomposition task. We introduce three key components to the pre-trained model for the adaptation: 1) Multi-Modal context Tokenization, 2) Parameter-Efficient Fine-Tuning (PEFT), 3) RGBA Decoding. Original components from the pre-trained model are highlighted in light blue, while our added or modified components are shown in orange.
  • Figure 2: Detailed diagram of the key components in our proposed adaptation method. Light blue boxes denote components from the original pre-trained inpainting DiT model, while orange boxes represent our modifications or additions. Our approach efficiently incorporates both Image-Mask Context and Multi-Modal Context tokens to guide generation. After adaptation, the model can simultaneously output an extracted and outpainted foreground along with a clean, object-removed background.
  • Figure 3: We illustrate the difference between the standard image-mask context used in diffusion-based inpainting models, shown in the light blue box as $c^{b}_{I-M}$, and our proposed image-mask context for image layer decomposition, shown in the orange box as $\{c^{f}_{I-M}, c^{b}_{I-M}\}$.
  • Figure 4: Our training data consists three sources: backgrounds, real foregrounds, and synthetic foregrounds.
  • Figure 5: We present examples comparing our method against baselines on our collected real-world image test set for the object removal task. These qualitative results highlight the visual differences in foreground removal accuracy, background reconstruction quality, and consistency across various challenging scenes. Please zoom in for the best viewing quality.
  • ...and 8 more figures