Table of Contents
Fetching ...

MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation

Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Fei Chen, Steven McDonagh, Gerasimos Lampouras, Ignacio Iacobacci, Sarah Parisot

TL;DR

MuLAn tackles the challenge of controllable text-to-image generation by providing a large, publicly available dataset of multi-layer RGBA decompositions for real-world images. It introduces a training-free pipeline that extracts per-instance RGBA layers from monocular RGB images using pretrained detectors, segmentation, depth, and diffusion-based inpainting, then reassembles these into an RGBA stack. Built from COCO and LAION Aesthetic 6.5, MuLAn enables layer-wise generation and editing, demonstrated through RGBA generation and instance addition tasks, with robust filtering to ensure high-quality decompositions. This resource and its modular pipeline pave the way for layer-aware generation and editing methods, potentially improving local fidelity and controllability in diffusion-based synthesis and modification workflows.

Abstract

Text-to-image generation has achieved astonishing results, yet precise spatial controllability and prompt fidelity remain highly challenging. This limitation is typically addressed through cumbersome prompt engineering, scene layout conditioning, or image editing techniques which often require hand drawn masks. Nonetheless, pre-existing works struggle to take advantage of the natural instance-level compositionality of scenes due to the typically flat nature of rasterized RGB output images. Towards adressing this challenge, we introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer ANnotations of RGB images as multilayer, instance-wise RGBA decompositions, and over 100K instance images. To build MuLAn, we developed a training free pipeline which decomposes a monocular RGB image into a stack of RGBA layers comprising of background and isolated instances. We achieve this through the use of pretrained general-purpose models, and by developing three modules: image decomposition for instance discovery and extraction, instance completion to reconstruct occluded areas, and image re-assembly. We use our pipeline to create MuLAn-COCO and MuLAn-LAION datasets, which contain a variety of image decompositions in terms of style, composition and complexity. With MuLAn, we provide the first photorealistic resource providing instance decomposition and occlusion information for high quality images, opening up new avenues for text-to-image generative AI research. With this, we aim to encourage the development of novel generation and editing technology, in particular layer-wise solutions. MuLAn data resources are available at https://MuLAn-dataset.github.io/.

MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation

TL;DR

MuLAn tackles the challenge of controllable text-to-image generation by providing a large, publicly available dataset of multi-layer RGBA decompositions for real-world images. It introduces a training-free pipeline that extracts per-instance RGBA layers from monocular RGB images using pretrained detectors, segmentation, depth, and diffusion-based inpainting, then reassembles these into an RGBA stack. Built from COCO and LAION Aesthetic 6.5, MuLAn enables layer-wise generation and editing, demonstrated through RGBA generation and instance addition tasks, with robust filtering to ensure high-quality decompositions. This resource and its modular pipeline pave the way for layer-aware generation and editing methods, potentially improving local fidelity and controllability in diffusion-based synthesis and modification workflows.

Abstract

Text-to-image generation has achieved astonishing results, yet precise spatial controllability and prompt fidelity remain highly challenging. This limitation is typically addressed through cumbersome prompt engineering, scene layout conditioning, or image editing techniques which often require hand drawn masks. Nonetheless, pre-existing works struggle to take advantage of the natural instance-level compositionality of scenes due to the typically flat nature of rasterized RGB output images. Towards adressing this challenge, we introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer ANnotations of RGB images as multilayer, instance-wise RGBA decompositions, and over 100K instance images. To build MuLAn, we developed a training free pipeline which decomposes a monocular RGB image into a stack of RGBA layers comprising of background and isolated instances. We achieve this through the use of pretrained general-purpose models, and by developing three modules: image decomposition for instance discovery and extraction, instance completion to reconstruct occluded areas, and image re-assembly. We use our pipeline to create MuLAn-COCO and MuLAn-LAION datasets, which contain a variety of image decompositions in terms of style, composition and complexity. With MuLAn, we provide the first photorealistic resource providing instance decomposition and occlusion information for high quality images, opening up new avenues for text-to-image generative AI research. With this, we aim to encourage the development of novel generation and editing technology, in particular layer-wise solutions. MuLAn data resources are available at https://MuLAn-dataset.github.io/.
Paper Structure (42 sections, 30 figures, 4 tables)

This paper contains 42 sections, 30 figures, 4 tables.

Figures (30)

  • Figure 1: Example annotations from our MuLAn dataset. We decompose an image into a multi-layer RGBA stack, where each layer comprises an instance image with transparent alpha layer (green overlays) and background image. For each scene, the second row shows iterative addition of RGBA instance layers.
  • Figure 2: Illustration of our RGBA decomposition objective.
  • Figure 3: Illustration of the inpainting procedure for a given instance.
  • Figure 4: Overview of our RGBA decomposition pipeline
  • Figure 5: Failure distribution on manually annotated data subset.
  • ...and 25 more figures