FIRM: Flexible Interactive Reflection reMoval

Xiao Chen; Xudong Jiang; Yunkang Tao; Zhen Lei; Qing Li; Chenyang Lei; Zhaoxiang Zhang

FIRM: Flexible Interactive Reflection reMoval

Xiao Chen, Xudong Jiang, Yunkang Tao, Zhen Lei, Qing Li, Chenyang Lei, Zhaoxiang Zhang

TL;DR

This work tackles the ill-posed problem of single-image reflection removal by introducing FIRM, a flexible interactive framework that accepts diverse user guidance forms and converts them into contrastive masks via a dedicated UGC module. A Segmentation Any Reflection Model (SARM) enables visual and text prompts to produce accurate reflection/transmission masks, which are then fused with the blended image through a Contrastive Guidance Interaction Block (CGIB) built on cross-attention to achieve precise layer separation with a lightweight CNN backbone. The method delivers state-of-the-art results on Real20 and SIR2 while drastically reducing annotation time (from tens of seconds to a few seconds per image) and supports a new interactive reflection removal dataset with four guidance modalities. Overall, FIRM enhances practicality and performance for real-world reflection removal by unifying guidance forms and enabling efficient, accurate segmentation-guided decomposition.

Abstract

Removing reflection from a single image is challenging due to the absence of general reflection priors. Although existing methods incorporate extensive user guidance for satisfactory performance, they often lack the flexibility to adapt user guidance in different modalities, and dense user interactions further limit their practicality. To alleviate these problems, this paper presents FIRM, a novel framework for Flexible Interactive image Reflection reMoval with various forms of guidance, where users can provide sparse visual guidance (e.g., points, boxes, or strokes) or text descriptions for better reflection removal. Firstly, we design a novel user guidance conversion module (UGC) to transform different forms of guidance into unified contrastive masks. The contrastive masks provide explicit cues for identifying reflection and transmission layers in blended images. Secondly, we devise a contrastive mask-guided reflection removal network that comprises a newly proposed contrastive guidance interaction block (CGIB). This block leverages a unique cross-attention mechanism that merges contrastive masks with image features, allowing for precise layer separation. The proposed framework requires only 10\% of the guidance time needed by previous interactive methods, which makes a step-change in flexibility. Extensive results on public real-world reflection removal datasets validate that our method demonstrates state-of-the-art reflection removal performance. Code is avaliable at https://github.com/ShawnChenn/FlexibleReflectionRemoval.

FIRM: Flexible Interactive Reflection reMoval

TL;DR

Abstract

Paper Structure (13 sections, 5 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 13 sections, 5 equations, 5 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Method
Overview
UGC: User Guidance Conversion
Contrastive Mask-Guided Reflection Removal
Experiments
Dataset
Implementation Details
Evaluations on Reflection Removal
Ablation Study
Conclusion
Acknowledgments

Figures (5)

Figure 1: Comparison between previous interactive method levin2007userZhang2020FastUS and ours.(a) and (b) illustrate the structural differences. The previous methods are guidance-specific, with tailored reflection removal networks ($\mathcal{R_{\text{point}}}$, $\mathcal{R_{\text{stroke}}}$) for each guidance form(e.g., point or stroke). In contrast, our framework is flexible, utilizing a conversion module to accommodate various forms of guidance by transforming them into a unified "segmentation mask". (c) Additionally, we compare the time cost of providing user guidance, where our method requires significantly less time per image than the results reported in previous works Zhang2020FastUS.
Figure 2: Illustration of our proposed pipeline FIRM. FIRM receives a blended image with diverse forms of user guidance, such as visual guidance or text descriptions. The user guidance conversion module (UGC) first transforms the raw input into contrastive masks with the user guidance. Then, the contrastive mask-guided network, incorporated with our designed Contrastive Guidance Interaction Block (CGIB) blocks, utilizes contrastive masks to separate the transmission and reflection layers from the blended input. (Detailed network configurations are provided in supplementary materials.)
Figure 3: Illustration of the training pipeline of SARM. We introduce learnable degradation-invariant token and feature selection block into the original SAM architecture, aiming for accurate mask prediction in blended images. To maintain the zero-shot capability of SAM kirillov2023SAM, only a limited number of parameters in the mask decoder are trainable, while the parameters of the image encoder and prompt encoder from the pre-trained SAM remain fixed.
Figure 4: Qualitative comparison of estimated transmissions between representative single-image-based methods and ours on Real20 and SIR2 datasets. Single-image-based methods struggle to remove sharp reflections. Our approach achieves much better reflection removal than baselines with very sparse point guidance on reflection and transmisson areas.
Figure 5: Qualitative comparison of predicted transmissions between state-of-the-arts interactive methods and ours on SIR2 datasets wan2017benchmarking. The guidance for reflection and transmission regions is labeled with different colors. Our approach achieves superior reflection removal using just 2 sparse points.

FIRM: Flexible Interactive Reflection reMoval

TL;DR

Abstract

FIRM: Flexible Interactive Reflection reMoval

Authors

TL;DR

Abstract

Table of Contents

Figures (5)