Table of Contents
Fetching ...

Referring Layer Decomposition

Fangyi Chen, Yaojie Shen, Lu Xu, Ye Yuan, Shu Zhang, Yulei Niu, Longyin Wen

TL;DR

RefLayer is presented, a simple baseline designed for prompt-conditioned layer decomposition, achieving high visual fidelity and semantic alignment, and establishes RLD as a well-defined and benchmarkable research task.

Abstract

Precise, object-aware control over visual content is essential for advanced image editing and compositional generation. Yet, most existing approaches operate on entire images holistically, limiting the ability to isolate and manipulate individual scene elements. In contrast, layered representations, where scenes are explicitly separated into objects, environmental context, and visual effects, provide a more intuitive and structured framework for interpreting and editing visual content. To bridge this gap and enable both compositional understanding and controllable editing, we introduce the Referring Layer Decomposition (RLD) task, which predicts complete RGBA layers from a single RGB image, conditioned on flexible user prompts, such as spatial inputs (e.g., points, boxes, masks), natural language descriptions, or combinations thereof. At the core is the RefLade, a large-scale dataset comprising 1.11M image-layer-prompt triplets produced by our scalable data engine, along with 100K manually curated, high-fidelity layers. Coupled with a perceptually grounded, human-preference-aligned automatic evaluation protocol, RefLade establishes RLD as a well-defined and benchmarkable research task. Building on this foundation, we present RefLayer, a simple baseline designed for prompt-conditioned layer decomposition, achieving high visual fidelity and semantic alignment. Extensive experiments show our approach enables effective training, reliable evaluation, and high-quality image decomposition, while exhibiting strong zero-shot generalization capabilities.

Referring Layer Decomposition

TL;DR

RefLayer is presented, a simple baseline designed for prompt-conditioned layer decomposition, achieving high visual fidelity and semantic alignment, and establishes RLD as a well-defined and benchmarkable research task.

Abstract

Precise, object-aware control over visual content is essential for advanced image editing and compositional generation. Yet, most existing approaches operate on entire images holistically, limiting the ability to isolate and manipulate individual scene elements. In contrast, layered representations, where scenes are explicitly separated into objects, environmental context, and visual effects, provide a more intuitive and structured framework for interpreting and editing visual content. To bridge this gap and enable both compositional understanding and controllable editing, we introduce the Referring Layer Decomposition (RLD) task, which predicts complete RGBA layers from a single RGB image, conditioned on flexible user prompts, such as spatial inputs (e.g., points, boxes, masks), natural language descriptions, or combinations thereof. At the core is the RefLade, a large-scale dataset comprising 1.11M image-layer-prompt triplets produced by our scalable data engine, along with 100K manually curated, high-fidelity layers. Coupled with a perceptually grounded, human-preference-aligned automatic evaluation protocol, RefLade establishes RLD as a well-defined and benchmarkable research task. Building on this foundation, we present RefLayer, a simple baseline designed for prompt-conditioned layer decomposition, achieving high visual fidelity and semantic alignment. Extensive experiments show our approach enables effective training, reliable evaluation, and high-quality image decomposition, while exhibiting strong zero-shot generalization capabilities.
Paper Structure (47 sections, 4 equations, 13 figures, 7 tables)

This paper contains 47 sections, 4 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: The RefLayer model trained on RefLade demonstrating the RLD task. Each row presents two different prompts and their corresponding layer outputs for the same input image. Given an image and diverse user prompts, RLD requires the model to generate targeted and complete RGBA layers. The figure showcases examples across prompting modes: Spatial Prompting (e.g., points, boxes, masks) and Linguistic Prompting (e.g., text descriptions like “the brown and white horse” or “background”). Coarse prompts such as a single point may lead to coarse-grained outputs (e.g., a combination of a walker and a dog), while more precise prompts yield accurate, object-specific layers, highlighting multi-granularity capabilities of RefLayer and its strong controllability and generalization.
  • Figure 2: Overview of the data engine. The pipeline decomposes a natural image into multiple prompt-aligned RGBA layers through six automatic stages: pre-filtering, scene understanding, layer completion, post-completion, prompt generation, and post-filtering.
  • Figure 3: Analysis of RefLade dataset. (a) Instance distribution: Most training images contain 1–3 instances, covering a wide range of object categories. (b) Size comparison: RefLade includes significantly more small instances (by area ratio) than MuLAn.
  • Figure 4: Comparison of model evaluation metrics. (a) The HPA score shows strong alignment with Human ELO rankings. (b) In contrast, none of the individual metrics ($S_{\text{vis}}$, $S_{\text{fid}}$, $S_{\text{gen}}$) consistently align with human preferences across models. Model A-I are anonymous for ELO.
  • Figure 5: RefLayer Model architecture. The model supports prompt-conditioned layer generation using spatial (box, point, mask) and/or textual inputs.
  • ...and 8 more figures