From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition
Jingxi Chen, Yixiao Zhang, Xiaoye Qian, Zongxia Li, Cornelia Fermuller, Caren Chen, Yiannis Aloimonos
TL;DR
The paper tackles image layer decomposition by reframing it as a combination of inpainting and outpainting and shows that a pre-trained diffusion-based inpainting model can be efficiently repurposed via lightweight fine-tuning to extract foreground with occlusion recovery and reconstruct the background with the object removed (Outpaint-and-Remove). It introduces a Multi-Modal Context Fusion with linear attention to preserve latent detail, a dual image-mask context, and a parameter-efficient fine-tuning regime using LoRA, along with RGBA foreground decoding. Training relies solely on public data, assembling roughly 100k image–foreground–background triplets from MULAN, LayerDiffuse, and OpenImages, enabling data-efficient learning. Empirical results on MULAN and real-world images demonstrate state-of-the-art performance for image layer decomposition and foreground removal, with substantial reductions in data and compute compared to fully fine-tuned baselines. The work enables flexible, high-quality layer editing and has practical impact for creative tools and downstream editing applications.
Abstract
Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modal context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.
