Table of Contents
Fetching ...

Rectifying Latent Space for Generative Single-Image Reflection Removal

Mingjia Li, Jin Hu, Hainuo Wang, Qiming Hu, Jiarui Wang, Xiaojie Guo

TL;DR

This work tackles the ill-posed problem of single-image reflection removal by identifying that latent spaces of pretrained encoders do not align with the linear superposition of background and reflection. It introduces GenSIRR, a diffusion-based pipeline with a reflection-equivariant VAE to restructure latent geometry, a Learnable Task Embedding for precise guidance, and a depth-guided early-branching sampling strategy to select high-quality restorations. Through a two-stage training regime and extensive benchmarks, GenSIRR achieves state-of-the-art results and demonstrates strong generalization to challenging real-world images, albeit with higher inference latency. The approach offers a practical path toward reliable, high-fidelity SIRR in the wild and highlights directions for acceleration and broader layer-separation tasks.

Abstract

Single-image reflection removal is a highly ill-posed problem, where existing methods struggle to reason about the composition of corrupted regions, causing them to fail at recovery and generalization in the wild. This work reframes an editing-purpose latent diffusion model to effectively perceive and process highly ambiguous, layered image inputs, yielding high-quality outputs. We argue that the challenge of this conversion stems from a critical yet overlooked issue, i.e., the latent space of semantic encoders lacks the inherent structure to interpret a composite image as a linear superposition of its constituent layers. Our approach is built on three synergistic components, including a reflection-equivariant VAE that aligns the latent space with the linear physics of reflection formation, a learnable task-specific text embedding for precise guidance that bypasses ambiguous language, and a depth-guided early-branching sampling strategy to harness generative stochasticity for promising results. Extensive experiments reveal that our model achieves new SOTA performance on multiple benchmarks and generalizes well to challenging real-world cases.

Rectifying Latent Space for Generative Single-Image Reflection Removal

TL;DR

This work tackles the ill-posed problem of single-image reflection removal by identifying that latent spaces of pretrained encoders do not align with the linear superposition of background and reflection. It introduces GenSIRR, a diffusion-based pipeline with a reflection-equivariant VAE to restructure latent geometry, a Learnable Task Embedding for precise guidance, and a depth-guided early-branching sampling strategy to select high-quality restorations. Through a two-stage training regime and extensive benchmarks, GenSIRR achieves state-of-the-art results and demonstrates strong generalization to challenging real-world images, albeit with higher inference latency. The approach offers a practical path toward reliable, high-fidelity SIRR in the wild and highlights directions for acceleration and broader layer-separation tasks.

Abstract

Single-image reflection removal is a highly ill-posed problem, where existing methods struggle to reason about the composition of corrupted regions, causing them to fail at recovery and generalization in the wild. This work reframes an editing-purpose latent diffusion model to effectively perceive and process highly ambiguous, layered image inputs, yielding high-quality outputs. We argue that the challenge of this conversion stems from a critical yet overlooked issue, i.e., the latent space of semantic encoders lacks the inherent structure to interpret a composite image as a linear superposition of its constituent layers. Our approach is built on three synergistic components, including a reflection-equivariant VAE that aligns the latent space with the linear physics of reflection formation, a learnable task-specific text embedding for precise guidance that bypasses ambiguous language, and a depth-guided early-branching sampling strategy to harness generative stochasticity for promising results. Extensive experiments reveal that our model achieves new SOTA performance on multiple benchmarks and generalizes well to challenging real-world cases.

Paper Structure

This paper contains 16 sections, 3 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Upper row shows challenging real-world captures with strong reflections. Reflections interfere with the performance of downstream tasks, for instance, monocular depth estimation and zero-shot segmentation. Lower row reveals that our proposed GenSIRR generalizes consistently well in these challenging cases, producing plausible and accurate results.
  • Figure 2: The overall pipeline of our proposed method. In Stage I, we use reflection-equivalence loss to regularize the latent space; during Stage II, the VAE encoder and decoder are frozen, while the text encoder encodes the initial prompt to initialize the learnable task embedding. In this stage, the task embedding and the DiT model are trained; Stage III is an additional test-time scaling stage. By sampling with multiple seeds, our scoring model can automatically select a faithful candidate with minimum computational overhead.
  • Figure 3: One-step outputs (upper) and final results (lower).
  • Figure 4: Examples of input, ground truth, and the corresponding mask on the OpenRR-val dataset with a mask threshold of 5.
  • Figure 5: Qualitative comparisons on Nature (top row) and Real20 (bottom row) datasets. Please zoom in for more details.
  • ...and 2 more figures