Table of Contents
Fetching ...

Reflection Removal through Efficient Adaptation of Diffusion Transformers

Daniyar Zakarin, Thiemo Wandel, Anton Obukhov, Dengxin Dai

TL;DR

This paper addresses single-image reflection removal by repurposing a pre-trained diffusion-transformer (DiT) with LoRA adapters for one-step latent-space editing. It introduces a physically based rendering (PBR) data generation pipeline to synthesize realistic glass reflections, paired with a two-stream latent flow-matching approach that yields a fast, high-fidelity transmission reconstruction without multi-step sampling. The method achieves state-of-the-art performance on in-domain and zero-shot benchmarks and demonstrates robust generalization to in-the-wild images, while maintaining training efficiency on a single consumer GPU. The work suggests that diffusion-Transformer priors, when combined with physically grounded data and lightweight adaptation, provide a scalable framework for reflection removal and related computational photography tasks, with potential extensions to video and more complex glass scenarios.

Abstract

We introduce a diffusion-transformer (DiT) framework for single-image reflection removal that leverages the generalization strengths of foundation diffusion models in the restoration setting. Rather than relying on task-specific architectures, we repurpose a pre-trained DiT-based foundation model by conditioning it on reflection-contaminated inputs and guiding it toward clean transmission layers. We systematically analyze existing reflection removal data sources for diversity, scalability, and photorealism. To address the shortage of suitable data, we construct a physically based rendering (PBR) pipeline in Blender, built around the Principled BSDF, to synthesize realistic glass materials and reflection effects. Efficient LoRA-based adaptation of the foundation model, combined with the proposed synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks. These results demonstrate that pretrained diffusion transformers, when paired with physically grounded data synthesis and efficient adaptation, offer a scalable and high-fidelity solution for reflection removal. Project page: https://hf.co/spaces/huawei-bayerlab/windowseat-reflection-removal-web

Reflection Removal through Efficient Adaptation of Diffusion Transformers

TL;DR

This paper addresses single-image reflection removal by repurposing a pre-trained diffusion-transformer (DiT) with LoRA adapters for one-step latent-space editing. It introduces a physically based rendering (PBR) data generation pipeline to synthesize realistic glass reflections, paired with a two-stream latent flow-matching approach that yields a fast, high-fidelity transmission reconstruction without multi-step sampling. The method achieves state-of-the-art performance on in-domain and zero-shot benchmarks and demonstrates robust generalization to in-the-wild images, while maintaining training efficiency on a single consumer GPU. The work suggests that diffusion-Transformer priors, when combined with physically grounded data and lightweight adaptation, provide a scalable framework for reflection removal and related computational photography tasks, with potential extensions to video and more complex glass scenarios.

Abstract

We introduce a diffusion-transformer (DiT) framework for single-image reflection removal that leverages the generalization strengths of foundation diffusion models in the restoration setting. Rather than relying on task-specific architectures, we repurpose a pre-trained DiT-based foundation model by conditioning it on reflection-contaminated inputs and guiding it toward clean transmission layers. We systematically analyze existing reflection removal data sources for diversity, scalability, and photorealism. To address the shortage of suitable data, we construct a physically based rendering (PBR) pipeline in Blender, built around the Principled BSDF, to synthesize realistic glass materials and reflection effects. Efficient LoRA-based adaptation of the foundation model, combined with the proposed synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks. These results demonstrate that pretrained diffusion transformers, when paired with physically grounded data synthesis and efficient adaptation, offer a scalable and high-fidelity solution for reflection removal. Project page: https://hf.co/spaces/huawei-bayerlab/windowseat-reflection-removal-web

Paper Structure

This paper contains 12 sections, 3 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: We present WindowSeat, a model and fine-tuning protocol for one-step reflection removal. It repurposes a foundation image diffusion transformer (DiT) into a state-of-the-art computational photography tool, enabled by an efficient and scalable Physically Based Rendering (PBR) pipeline for data synthesis. WindowSeat demonstrates stronger scene understanding and source-separation capabilities than competing methods, yielding cleaner outputs with fewer artifacts. For the visualization above, the "Scenario", "Ground truth" transmission, and reflection layers were generated; "Photos with reflection" were produced by our proposed PBR pipeline; results images are obtained from the respective methods. Best viewed zoomed in; arrows point at artifacts of methods; contrast enhanced for visualization.
  • Figure 2: Physically Based Rendering (PBR) pipeline for synthetic data generation.Left: The synthesis begins by sampling the foreground and background images, which can be in sRGB or HDR formats. The images are placed into a static 3D scene with a glass plate positioned in front of a virtual camera. The camera parameters and object distances are chosen to cover the view frustum of the virtual camera along transmission and reflection paths. Middle: At the heart of our pipeline is the Principled BSDF shading model Burley2012PhysicallyBasedSA2015ExtendingTD, implemented in Blender blender, which enables simulation of a wide range of photorealistic glass effects and light interactions. Right: Visualizations of three factors of variation. Index of Refraction (IoR) affects reflection strength. Thickness increases ghosting, which appears as larger gap between the multiple reflections (arrows). Roughness controls the degree of scatter and blur. Such a simulation cannot be faithfully reproduced by screen-space alpha blending models. Details in Sec. \ref{['sec:alpha_blending']} and \ref{['sec:data_synthesis']}. Best viewed zoomed in.
  • Figure 3: Model architecture. Foundation DiTs peebles2023scalable operate in a compressed latent space in the bottleneck of a VAE Kingma2014. Fine-tuning DiTs can be done efficiently with lightweight LoRA hu2022lora adapters. Modern DiTs batifol2025flux with more than 10B parameters often employ quantized representations, such as QLoRA qloraliu2025fluxqlora. The end-to-end fine-tuning procedure is elaborated in Sec. \ref{['sec:one_step_fm_rr']}.
  • Figure 4: Ablations. Left: PBR vs Alpha Blending; Right: Latent vs Flow objectives (Sec. \ref{['sec:ablation']}). As seen, PBR data and Flow objective produce more accurate results. Best viewed zoomed in.
  • Figure 5: Qualitative comparison. Each column shows one Input-GT pair with the corresponding predictions from WindowSeat and SotA methods. WindowSeat detects and removes the reflection in the first two examples, while other methods leave the reflections unaltered. Columns 3-5 visualize the improved reflection removal capabilities of WindowSeat, leaving fewer artifacts in the predictions. Best viewed on screen and zoomed in.
  • ...and 3 more figures