Table of Contents
Fetching ...

LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition

Vlad-Constantin Lungu-Stan, Ionut Mironica, Mariana-Iuliana Georgescu

Abstract

Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).

LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition

Abstract

Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).
Paper Structure (16 sections, 3 equations, 10 figures, 3 tables)

This paper contains 16 sections, 3 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: The unified LaDe performs (a) Text-to-Layers generation in the RGBA space and (b) Text-to-Image generation based on a prompt, (c) Image-to-Layers generation in the RGBA space given a image. LaDe works with variable aspect ratio and number of layers.
  • Figure 2: Text-to-Image Generation with LaDe.
  • Figure 3: Text-to-Layers Generation with LaDe. The gray background is added to emphasize the while content.
  • Figure 4: Image-to-Layers Decomposition with LaDe. The samples are generated by two state-of-the-art proprietary GenAI frameworks called through their APIs.
  • Figure 5: Text-to-Layers generation pipeline (top). Given a short user prompt, LaDe expands it with additional information which is encoded by FlanT5 XXL Chung-JMLR-2024 and given as additional input to the Diffusion Model. After denoising, the full media design along with RGBA layers are decoded through the RGBA decoder. Text-to-Image generation is obtained by setting the number of layers to 0, generating only the full media design. Image-to-Layers decomposition (bottom) starts from the original media design and goes through a captioning and layer splitting operation. The text information, along with the embedding of the input image is passed through the diffusion model. The rest of the pipeline is similar to the Text-to-Layers generation.
  • ...and 5 more figures