Table of Contents
Fetching ...

Flash-Split: 2D Reflection Removal with Flash Cues and Latent Diffusion Separation

Tianfu Wang, Mingyang Xie, Haoming Cai, Sachin Shah, Christopher A. Metzler

TL;DR

Glass and other transparent surfaces create reflections that degrade images. Flash-Split introduces a two-stage latent-diffusion framework that uses misaligned flash/no-flash cues to separate transmission and reflection in latent space, mitigating alignment sensitivity. Stage 1 performs recursive latent separation with a dual-branch diffusion network conditioned on a flash/no-flash latent pair, while Stage 2 uses cross-latent decoding guided by the original input to recover faithful, high-frequency details. Evaluations on real-world scenes show state-of-the-art performance and robustness to misalignment, including scenarios without RAW input, highlighting practical applicability.

Abstract

Transparent surfaces, such as glass, create complex reflections that obscure images and challenge downstream computer vision applications. We introduce Flash-Split, a robust framework for separating transmitted and reflected light using a single (potentially misaligned) pair of flash/no-flash images. Our core idea is to perform latent-space reflection separation while leveraging the flash cues. Specifically, Flash-Split consists of two stages. Stage 1 separates apart the reflection latent and transmission latent via a dual-branch diffusion model conditioned on an encoded flash/no-flash latent pair, effectively mitigating the flash/no-flash misalignment issue. Stage 2 restores high-resolution, faithful details to the separated latents, via a cross-latent decoding process conditioned on the original images before separation. By validating Flash-Split on challenging real-world scenes, we demonstrate state-of-the-art reflection separation performance and significantly outperform the baseline methods.

Flash-Split: 2D Reflection Removal with Flash Cues and Latent Diffusion Separation

TL;DR

Glass and other transparent surfaces create reflections that degrade images. Flash-Split introduces a two-stage latent-diffusion framework that uses misaligned flash/no-flash cues to separate transmission and reflection in latent space, mitigating alignment sensitivity. Stage 1 performs recursive latent separation with a dual-branch diffusion network conditioned on a flash/no-flash latent pair, while Stage 2 uses cross-latent decoding guided by the original input to recover faithful, high-frequency details. Evaluations on real-world scenes show state-of-the-art performance and robustness to misalignment, including scenarios without RAW input, highlighting practical applicability.

Abstract

Transparent surfaces, such as glass, create complex reflections that obscure images and challenge downstream computer vision applications. We introduce Flash-Split, a robust framework for separating transmitted and reflected light using a single (potentially misaligned) pair of flash/no-flash images. Our core idea is to perform latent-space reflection separation while leveraging the flash cues. Specifically, Flash-Split consists of two stages. Stage 1 separates apart the reflection latent and transmission latent via a dual-branch diffusion model conditioned on an encoded flash/no-flash latent pair, effectively mitigating the flash/no-flash misalignment issue. Stage 2 restores high-resolution, faithful details to the separated latents, via a cross-latent decoding process conditioned on the original images before separation. By validating Flash-Split on challenging real-world scenes, we demonstrate state-of-the-art reflection separation performance and significantly outperform the baseline methods.
Paper Structure (25 sections, 5 equations, 20 figures, 1 table)

This paper contains 25 sections, 5 equations, 20 figures, 1 table.

Figures (20)

  • Figure 1: Left: We separated the transmitted and reflected scenes by capturing one image with camera flash and another with no flash, despite them being potentially misaligned due to hand shake. Right: Our proposed Flash-Split method archives a precise separation of the transmission and the reflection, performing much better than the baseline lei2023tpami.
  • Figure 2: Conventional Flash/No-Flash Methods Need Perfectly Paired Captures. The camera flash increases the brightness of the transmitted scene without affecting that of the reflected scene. Therefore, the difference between this pair will be the transmitted scene free of reflection. Top Right: If we capture a perfectly aligned pair of flash/no-flash images using a tripod plus wireless shutter control, the difference is a perfect transmission image. Bottom Left: if we use a tripod but use a finger to press the shutter button, this slight motion will cause the two shots to be misaligned from each other, leading to noticeable artifacts in the difference image. Bottom Right: if we just do handheld photography, the difference image exhibits even stronger artifacts. Takeaway: this misalignment issue has been the key barrier to applying flash/no-flash photography, an accessible method with great potential, to the task of reflection removal. In our work, we propose a robust approach to circumvent this key barrier.
  • Figure 3: Aligning Flash/No-Flash Images Is A Difficult Task for Image Registration Methods. While the difference between a misaligned flash/no-flash image pair (a,b) exhibits severe artifacts (c), aligning them is a non-trivial problem, since camera flash modifies the appearance of the transmitted component of one of the two images. Existing registration methods, like homography (dfischler1981random) or optical flow prediction (eSun2018PWC-Net) used in lei2023tpami, fail to align this pair of images well -- their aligned flash/no-flash pair still suffer from severe artifacts. In contrast, our method (f) circumvents the misalignment issue by directly encoding the flash/no-flash pair into the latent space to perform recursive latent separation, eventually yielding a clean transmission scene.
  • Figure 4: Comparing Different 2D Reflection Removal Paradigms. (a): Software-only methods pass a single composite image (with both transmission and reflection) to a deep neural net for reflection separation. (b): Conventional flash/no-flash methods take the difference of a flash/no-flash image pair to get the transmission image Agrawal2005flash_vanilla; optionally, one can also use a neural net chang2020siamese to predict the reflection image and further refine the transmission image quality (omitted in the figure for simplicity). In cases of misalignment (when not using a tripod), lei2023tpami uses an optical flow module to pre-align the image pair. (c): Our proposed method encodes the flash/no-flash method down to the latent space: we first encode the flash/no-flash image pair into a flash/no-flash latent pair, then use its physical cue to separate the composite scene's latent into a transmission latent and a reflection latent, and finally decode them back to RGB image space to obtain the clean transmission image and reflection image.
  • Figure 5: Our Proposed Pipeline consists of a latent separation stage and a decoding stage. Left: We first encode the misaligned flash/no-flash image pair into a flash/no-flash latent pair. We then use a dual-branch attention UNet with cross-attention in-between to perform latent separation --- the goal is to predict a latent for the transmission scene and another latent for the reflection scene. Following recent development of latent diffusion models rombach2021latentdiffusionke2023marigold, at each inference step, we concatenate both the flash/no-flash latents with random Gaussian noise and let the dual-branch UNet denoise them. Eventually, the top and bottom branches predict a transmission and reflection latent, respectively. Right: We observe that the vanilla decoding process may lead to hallucination and blurriness (Figure \ref{['fig:12_decoder']}). To fix this issue, we apply a cross-latent decoding process with a UNet unet architecture. But unlike a normal UNet, we do not feed the encoder's output into the decoder. Instead, we (1) feed the original unseparated image into the encoder and (2) feed our separated latent (from the first stage) directly into the decoder. The encoder passes information to the decoder only through the skip connection layers. This decoding process combines two complementary sources of information: the predicted latent from Stage 1, separated but missing high-frequency information, and the captured image, unseparated but contains high-frequency details, leading to a faithful reconstruction of the original transmission/reflection scenes.
  • ...and 15 more figures