Table of Contents
Fetching ...

Empowering Semantic-Sensitive Underwater Image Enhancement with VLM

Guodong Fan, Shengning Zhou, Genji Yuan, Huiyu Li, Jingchun Zhou, Jinjiang Li

Abstract

In recent years, learning-based underwater image enhancement (UIE) techniques have rapidly evolved. However, distribution shifts between high-quality enhanced outputs and natural images can hinder semantic cue extraction for downstream vision tasks, thereby limiting the adaptability of existing enhancement models. To address this challenge, this work proposes a new learning mechanism that leverages Vision-Language Models (VLMs) to empower UIE models with semantic-sensitive capabilities. To be concrete, our strategy first generates textual descriptions of key objects from a degraded image via VLMs. Subsequently, a text-image alignment model remaps these relevant descriptions back onto the image to produce a spatial semantic guidance map. This map then steers the UIE network through a dual-guidance mechanism, which combines cross-attention and an explicit alignment loss. This forces the network to focus its restorative power on semantic-sensitive regions during image reconstruction, rather than pursuing a globally uniform improvement, thereby ensuring the faithful restoration of key object features. Experiments confirm that when our strategy is applied to different UIE baselines, significantly boosts their performance on perceptual quality metrics as well as enhances their performance on detection and segmentation tasks, validating its effectiveness and adaptability.

Empowering Semantic-Sensitive Underwater Image Enhancement with VLM

Abstract

In recent years, learning-based underwater image enhancement (UIE) techniques have rapidly evolved. However, distribution shifts between high-quality enhanced outputs and natural images can hinder semantic cue extraction for downstream vision tasks, thereby limiting the adaptability of existing enhancement models. To address this challenge, this work proposes a new learning mechanism that leverages Vision-Language Models (VLMs) to empower UIE models with semantic-sensitive capabilities. To be concrete, our strategy first generates textual descriptions of key objects from a degraded image via VLMs. Subsequently, a text-image alignment model remaps these relevant descriptions back onto the image to produce a spatial semantic guidance map. This map then steers the UIE network through a dual-guidance mechanism, which combines cross-attention and an explicit alignment loss. This forces the network to focus its restorative power on semantic-sensitive regions during image reconstruction, rather than pursuing a globally uniform improvement, thereby ensuring the faithful restoration of key object features. Experiments confirm that when our strategy is applied to different UIE baselines, significantly boosts their performance on perceptual quality metrics as well as enhances their performance on detection and segmentation tasks, validating its effectiveness and adaptability.
Paper Structure (32 sections, 5 equations, 7 figures, 2 tables)

This paper contains 32 sections, 5 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Impact of our semantic-sensitive strategy on downstream tasks. The radar charts on the left demonstrate consistent quantitative improvements in semantic segmentation and object detection when baseline models are empowered by our -SS strategy (in Red). Correspondingly, the qualitative examples on the right illustrate these benefits: our method leads to segmentation results closer to the Ground Truth (top) and enables more confident and accurate object detection (bottom).
  • Figure 2: Overview of our semantic-sensitive learning strategy. A VLM-generated guidance map steers the UIE network via a dual-guidance approach: cross-attention and explicit supervision via an alignment loss.
  • Figure 3: Visual comparison of enhancement results. Our -SS models produce images with better color fidelity and sharper details on key objects on the UIEB dataset (top). In challenging U45 scenes (bottom), our method effectively restores natural colors while avoiding artifacts introduced by baseline models.
  • Figure 4: Visual comparison on the object detection task. Our -SS enhancement significantly improves the detection of small, low-contrast objects in both murky water body (top) and complex seabed (bottom) environments, effectively mitigating the missed detection issue prevalent in baseline methods.
  • Figure 5: Visual comparison on the semantic segmentation task. Our semantic-sensitive enhancement preserves object boundaries and reduces background confusion in both dark (top row) and light (bottom row) scenes, leading to more accurate segmentation masks compared to baseline methods.
  • ...and 2 more figures