Table of Contents
Fetching ...

Language-guided Image Reflection Separation

Haofeng Zhong, Yuchen Hong, Shuchen Weng, Jinxiu Liang, Boxin Shi

TL;DR

This work tackles the ill-posed problem of reflection separation by introducing language guidance to provide semantic priors. It presents a cross-modal framework with adaptive global aggregation (AGAM) and interaction (AGIM) modules, gated language guidance, and a randomized training strategy to handle recognizable layer ambiguity, supported by contrastive and layer-correspondence losses. A dataset with synthetic and real image-language pairs is built to train and evaluate the approach, including a new RefOL real-world set. Empirical results on real data show state-of-the-art PSNR/SSIM and improved qualitative separation, demonstrating the practical potential of language-guided reflection separation for single-image scenarios.

Abstract

This paper studies the problem of language-guided reflection separation, which aims at addressing the ill-posed reflection separation problem by introducing language descriptions to provide layer content. We propose a unified framework to solve this problem, which leverages the cross-attention mechanism with contrastive learning strategies to construct the correspondence between language descriptions and image layers. A gated network design and a randomized training strategy are employed to tackle the recognizable layer ambiguity. The effectiveness of the proposed method is validated by the significant performance advantage over existing reflection separation methods on both quantitative and qualitative comparisons.

Language-guided Image Reflection Separation

TL;DR

This work tackles the ill-posed problem of reflection separation by introducing language guidance to provide semantic priors. It presents a cross-modal framework with adaptive global aggregation (AGAM) and interaction (AGIM) modules, gated language guidance, and a randomized training strategy to handle recognizable layer ambiguity, supported by contrastive and layer-correspondence losses. A dataset with synthetic and real image-language pairs is built to train and evaluate the approach, including a new RefOL real-world set. Empirical results on real data show state-of-the-art PSNR/SSIM and improved qualitative separation, demonstrating the practical potential of language-guided reflection separation for single-image scenarios.

Abstract

This paper studies the problem of language-guided reflection separation, which aims at addressing the ill-posed reflection separation problem by introducing language descriptions to provide layer content. We propose a unified framework to solve this problem, which leverages the cross-attention mechanism with contrastive learning strategies to construct the correspondence between language descriptions and image layers. A gated network design and a randomized training strategy are employed to tackle the recognizable layer ambiguity. The effectiveness of the proposed method is validated by the significant performance advantage over existing reflection separation methods on both quantitative and qualitative comparisons.
Paper Structure (15 sections, 5 equations, 5 figures, 2 tables)

This paper contains 15 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The recognizable layer ambiguity problem causes uncertain quantities of input language descriptions for language-guided image reflection separation. Given language descriptions of either (a) one layer or (b) both two layers, the proposed method achieves robust reflection separation compared with an existing reflection separation method dong2021location.
  • Figure 2: The pipeline of the proposed language-guided image reflection separation framework, which extracts features from mixture images and available language descriptions (the description $L_2$ with dashed lines is possible to be set to null due to the recognizable layer ambiguity) by image and language encoders (in Sec. \ref{['subsec:extract']}), aggregates global visual information by adaptive global aggregation modules (AGAM) and conducts progressive interactions to exploit distinctive image features with gated language guidance by adaptive global interaction modules (AGIM) (in Sec. \ref{['subsec:interact']}), and recovers image layers by image decoders (in Sec. \ref{['subsec:recover']}).
  • Figure 3: The architecture of the (a) adaptive global aggregation module (AGAM) and (b) adaptive global interaction module (AGIM), which aggregates global contextual information of visual features and achieves feature channel rearrangement with gated language guidance, respectively.
  • Figure 4: Qualitative comparison of estimated transmission and reflection layers on real data, compared with the state-of-the-art methods including DSRNet hu2023single, YTMT hu2021ytmt, Dong et al. dong2021location, and IBCLN li2020single. Please zoom in for details.
  • Figure 5: Results with different numbers of input language descriptions.