Table of Contents
Fetching ...

Retinex Meets Language: A Physics-Semantics-Guided Underwater Image Enhancement Network

Shixuan Xu, Yabo Liu, Junyu Dong, Xinghui Dong

TL;DR

This study makes the first effort to introduce both textual guidance and the multimodal data set into UIE tasks, and designs an Image-Text Semantic Similarity (ITSS) loss function.

Abstract

Underwater images often suffer from severe degradation caused by light absorption and scattering, leading to color distortion, low contrast and reduced visibility. Existing Underwater Image Enhancement (UIE) methods can be divided into two categories, i.e., prior-based and learning-based methods. The former rely on rigid physical assumptions that limit the adaptability, while the latter often face data scarcity and weak generalization. To address these issues, we propose a Physics-Semantics-Guided Underwater Image Enhancement Network (PSG-UIENet), which couples the Retinex-grounded illumination correction with the language-informed guidance. This network comprises a Prior-Free Illumination Estimator, a Cross-Modal Text Aligner and a Semantics-Guided Image Restorer. In particular, the restorer leverages the textual descriptions generated by the Contrastive Language-Image Pre-training (CLIP) model to inject high-level semantics for perceptually meaningful guidance. Since multimodal UIE data sets are not publicly available, we also construct a large-scale image-text UIE data set, namely, LUIQD-TD, which contains 6,418 image-reference-text triplets. To explicitly measure and optimize semantic consistency between textual descriptions and images, we further design an Image-Text Semantic Similarity (ITSS) loss function. To our knowledge, this study makes the first effort to introduce both textual guidance and the multimodal data set into UIE tasks. Extensive experiments on our data set and four publicly available data sets demonstrate that the proposed PSG-UIENet achieves superior or comparable performance against fifteen state-of-the-art methods.

Retinex Meets Language: A Physics-Semantics-Guided Underwater Image Enhancement Network

TL;DR

This study makes the first effort to introduce both textual guidance and the multimodal data set into UIE tasks, and designs an Image-Text Semantic Similarity (ITSS) loss function.

Abstract

Underwater images often suffer from severe degradation caused by light absorption and scattering, leading to color distortion, low contrast and reduced visibility. Existing Underwater Image Enhancement (UIE) methods can be divided into two categories, i.e., prior-based and learning-based methods. The former rely on rigid physical assumptions that limit the adaptability, while the latter often face data scarcity and weak generalization. To address these issues, we propose a Physics-Semantics-Guided Underwater Image Enhancement Network (PSG-UIENet), which couples the Retinex-grounded illumination correction with the language-informed guidance. This network comprises a Prior-Free Illumination Estimator, a Cross-Modal Text Aligner and a Semantics-Guided Image Restorer. In particular, the restorer leverages the textual descriptions generated by the Contrastive Language-Image Pre-training (CLIP) model to inject high-level semantics for perceptually meaningful guidance. Since multimodal UIE data sets are not publicly available, we also construct a large-scale image-text UIE data set, namely, LUIQD-TD, which contains 6,418 image-reference-text triplets. To explicitly measure and optimize semantic consistency between textual descriptions and images, we further design an Image-Text Semantic Similarity (ITSS) loss function. To our knowledge, this study makes the first effort to introduce both textual guidance and the multimodal data set into UIE tasks. Extensive experiments on our data set and four publicly available data sets demonstrate that the proposed PSG-UIENet achieves superior or comparable performance against fifteen state-of-the-art methods.
Paper Structure (31 sections, 19 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 31 sections, 19 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparison of three Retinex-based UIE methods, including Retinexformer retinexformer, RetinexMamba retinexmamba, and our PSG-UIENet. In terms of each image, the PSNR, SSIM ssim, and LPIPS lpips values computed between the corresponding reference image and it are shown at the top-left corner. These results highlight the effectiveness of the Retinex theory retinex_theory in alleviating underwater image degradation.
  • Figure 2: Representative samples from the LUQID-TD data set. Each example consists of three components: the left image shows the raw underwater input, the right image presents the reference image with the highest perceptual quality score, and the accompanying caption provides the textual description summarizing the scene and visual attributes. The data set spans diverse underwater scenarios, including coral reefs, marine life, divers, submerged wrecks, and underwater vehicles, thereby offering rich semantic and visual information for multimodal UIE.
  • Figure 3: Statistical analysis of the LUQID-TD textual annotations, including (a) word-frequency patterns, (b) the distribution of caption lengths, and (c) the distribution of VQA-based image–text consistency scores, demonstrating the semantic quality and reliability of the annotations.
  • Figure 4: The overall architecture of the proposed PSG-UIENet, which comprises three modules: (a) a Prior-Free Illumination Estimator that generates multi-scale light-enhanced representations, (b) a Cross-Modal Text Aligner that establishes semantic correspondence between text and image, and (c) a Semantics-Guided Image Restorer that performs multimodal fusion and enhancement using a dual-branch structure.
  • Figure 5: Architecture of the Semantics-Guided Encoder-Decoder Network. The symmetric encoder–decoder uses Transformer-Conv modules for joint local-global feature extraction. Cross-modal attention and a Cross-Attention FiLM Module (CFM) integrate textual semantics, enabling progressive image reconstruction with semantic and visual fusion.
  • ...and 5 more figures