Table of Contents
Fetching ...

Cycle-Consistent Tuning for Layered Image Decomposition

Zheng Gu, Min Lu, Zhida Sun, Dani Lischinski, Daniel Cohen-O, Hui Huang

TL;DR

This work presents an in-context image decomposition framework that leverages large diffusion foundation models for layered separation and introduces a cycle-consistent tuning strategy that jointly trains decomposition and composition models, enforcing reconstruction consistency between decomposed and recomposed images.

Abstract

Disentangling visual layers in real-world images is a persistent challenge in vision and graphics, as such layers often involve non-linear and globally coupled interactions, including shading, reflection, and perspective distortion. In this work, we present an in-context image decomposition framework that leverages large diffusion foundation models for layered separation. We focus on the challenging case of logo-object decomposition, where the goal is to disentangle a logo from the surface on which it appears while faithfully preserving both layers. Our method fine-tunes a pretrained diffusion model via lightweight LoRA adaptation and introduces a cycle-consistent tuning strategy that jointly trains decomposition and composition models, enforcing reconstruction consistency between decomposed and recomposed images. This bidirectional supervision substantially enhances robustness in cases where the layers exhibit complex interactions. Furthermore, we introduce a progressive self-improving process, which iteratively augments the training set with high-quality model-generated examples to refine performance. Extensive experiments demonstrate that our approach achieves accurate and coherent decompositions and also generalizes effectively across other decomposition types, suggesting its potential as a unified framework for layered image decomposition.

Cycle-Consistent Tuning for Layered Image Decomposition

TL;DR

This work presents an in-context image decomposition framework that leverages large diffusion foundation models for layered separation and introduces a cycle-consistent tuning strategy that jointly trains decomposition and composition models, enforcing reconstruction consistency between decomposed and recomposed images.

Abstract

Disentangling visual layers in real-world images is a persistent challenge in vision and graphics, as such layers often involve non-linear and globally coupled interactions, including shading, reflection, and perspective distortion. In this work, we present an in-context image decomposition framework that leverages large diffusion foundation models for layered separation. We focus on the challenging case of logo-object decomposition, where the goal is to disentangle a logo from the surface on which it appears while faithfully preserving both layers. Our method fine-tunes a pretrained diffusion model via lightweight LoRA adaptation and introduces a cycle-consistent tuning strategy that jointly trains decomposition and composition models, enforcing reconstruction consistency between decomposed and recomposed images. This bidirectional supervision substantially enhances robustness in cases where the layers exhibit complex interactions. Furthermore, we introduce a progressive self-improving process, which iteratively augments the training set with high-quality model-generated examples to refine performance. Extensive experiments demonstrate that our approach achieves accurate and coherent decompositions and also generalizes effectively across other decomposition types, suggesting its potential as a unified framework for layered image decomposition.
Paper Structure (42 sections, 9 equations, 18 figures, 2 tables, 1 algorithm)

This paper contains 42 sections, 9 equations, 18 figures, 2 tables, 1 algorithm.

Figures (18)

  • Figure 1: Our method learns to disentangle overlaid logos from their supporting surfaces and recompose them seamlessly onto other objects. Each example shows two input photographs of objects with distinct logos. We first decompose each image into its logo and object layers, and then cross-compose the separated logos onto the other objects. These results demonstrate accurate separation and faithful re-integration across challenging non-linear cases involving complex geometry, lighting, and viewpoint changes.
  • Figure 2: Overview of the image decompostion framework. Given a composite image, the model receives a masked input, a binary mask indicating the logo region, and a noise latent, and predicts both the isolated logo and the clean object. The process is implemented by tuning a LoRA on top of Flux-Fill. The composition scheme is similar but uses a complementary mask to produce the composite image.
  • Figure 3: Illustration of our cycle-consistent training. The model jointly learns decomposition and composition, ensuring the decomposed layers recompose into the original image and vice versa.
  • Figure 4: Illustration of progressive data collection. (a) We first collect a seed dataset to obtain an IC-LoRA as the initial data generator. (b) In each round, we select high-quality samples generated by the current LoRA and reintroduce them to the training set. (c) During the training of the cycle model, we use it to produce consistent data by decomposing an image, and then re-composing it. High-quality recomposition samples are added back into the training set.
  • Figure 5: Qualitative comparison on challenging scenarios on synthetic data. The first column shows the inputs, while the following columns present results from our approach and four baselines: AssetDropper li2025assetdropper, Flux-Kontext labs2025flux, Gemini comanici2025gemini, and ICEdit zhang2025context. The decomposed object layers appear at the bottom-right of each sample. For AssetDropper li2025assetdropper, we use FLUX-Fill flux2024 to inpaint the logo region as the object layer. Note that all the synthetic images are generated from a prompt, not composited from a logo and a clean object.
  • ...and 13 more figures