Table of Contents
Fetching ...

UWFormer: Underwater Image Enhancement via a Semi-Supervised Multi-Scale Transformer

Weiwen Chen, Yingtie Lei, Shenghong Luo, Ziyang Zhou, Mingxian Li, Chi-Man Pun

TL;DR

UWFormer tackles underwater image enhancement under limited paired data by introducing a semi-supervised, multi-scale Transformer. It splits frequency content with lossless DWT/IDWT and uses a Low-frequency MSFormer and a High-frequency Residual path, enhanced by Nonlinear Frequency-Aware Attention and a Multi-Scale Fusion Feed-forward Network. A Subaqueous Perceptual Loss guides pseudo-label generation in a teacher-student setup, enabling effective learning from unlabeled data. Comprehensive experiments on full-reference and no-reference benchmarks show superior visual quality and quantitative performance compared to state-of-the-art methods, highlighting the practical impact of frequency-aware, multi-scale design for underwater imaging.

Abstract

Underwater images often exhibit poor quality, distorted color balance and low contrast due to the complex and intricate interplay of light, water, and objects. Despite the significant contributions of previous underwater enhancement techniques, there exist several problems that demand further improvement: (i) The current deep learning methods rely on Convolutional Neural Networks (CNNs) that lack the multi-scale enhancement, and global perception field is also limited. (ii) The scarcity of paired real-world underwater datasets poses a significant challenge, and the utilization of synthetic image pairs could lead to overfitting. To address the aforementioned problems, this paper introduces a Multi-scale Transformer-based Network called UWFormer for enhancing images at multiple frequencies via semi-supervised learning, in which we propose a Nonlinear Frequency-aware Attention mechanism and a Multi-Scale Fusion Feed-forward Network for low-frequency enhancement. Besides, we introduce a special underwater semi-supervised training strategy, where we propose a Subaqueous Perceptual Loss function to generate reliable pseudo labels. Experiments using full-reference and non-reference underwater benchmarks demonstrate that our method outperforms state-of-the-art methods in terms of both quantity and visual quality.

UWFormer: Underwater Image Enhancement via a Semi-Supervised Multi-Scale Transformer

TL;DR

UWFormer tackles underwater image enhancement under limited paired data by introducing a semi-supervised, multi-scale Transformer. It splits frequency content with lossless DWT/IDWT and uses a Low-frequency MSFormer and a High-frequency Residual path, enhanced by Nonlinear Frequency-Aware Attention and a Multi-Scale Fusion Feed-forward Network. A Subaqueous Perceptual Loss guides pseudo-label generation in a teacher-student setup, enabling effective learning from unlabeled data. Comprehensive experiments on full-reference and no-reference benchmarks show superior visual quality and quantitative performance compared to state-of-the-art methods, highlighting the practical impact of frequency-aware, multi-scale design for underwater imaging.

Abstract

Underwater images often exhibit poor quality, distorted color balance and low contrast due to the complex and intricate interplay of light, water, and objects. Despite the significant contributions of previous underwater enhancement techniques, there exist several problems that demand further improvement: (i) The current deep learning methods rely on Convolutional Neural Networks (CNNs) that lack the multi-scale enhancement, and global perception field is also limited. (ii) The scarcity of paired real-world underwater datasets poses a significant challenge, and the utilization of synthetic image pairs could lead to overfitting. To address the aforementioned problems, this paper introduces a Multi-scale Transformer-based Network called UWFormer for enhancing images at multiple frequencies via semi-supervised learning, in which we propose a Nonlinear Frequency-aware Attention mechanism and a Multi-Scale Fusion Feed-forward Network for low-frequency enhancement. Besides, we introduce a special underwater semi-supervised training strategy, where we propose a Subaqueous Perceptual Loss function to generate reliable pseudo labels. Experiments using full-reference and non-reference underwater benchmarks demonstrate that our method outperforms state-of-the-art methods in terms of both quantity and visual quality.
Paper Structure (18 sections, 3 equations, 5 figures, 2 tables)

This paper contains 18 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The figure illustrates a comparison between (a) underwater image and (f) its enhanced version obtained from the EUVP dataset. Although (b),(c) traditional enhancement, and (d) deep learning models have been employed, they still manifest imperfections. By contrast, (e) our method effectively enhances the image.
  • Figure 2: Proposed underwater semi-supervised training strategy. Labeled images $X_l \in \mathbb{R}^{C \times H \times W}$ and unlabeled images $X_u \in \mathbb{R}^{C \times H \times W}$ are fed to: (i) the student model for training, and (ii) the teacher model for unlabeled prediction. The student model outputs labeled results $\hat{X_l}$ and unlabeled results $\hat{X_u}$, with the $\hat{X_l}$ converging using a supervised loss function. Note that the outputs of the teacher model are evaluated and updated in real-time using the proposed Subaqueous Perceptual Loss (SPL) to generate pseudo labels $X_p$. Finally, the unlabeled outputs of the student model $\hat{X_u}$ are constrained by $X_p$ using an unsupervised loss function. In this case, the weights of the student model are determined by both labeled data and unlabeled data.
  • Figure 3: Overall architecture of our proposed UWFormer. An input image $X \in \mathbb{R}^{C \times H \times W}$, it is first processed by DWT to produce three high-frequency images $\left\{ X_{LH}, X_{HL}, X_{HH} \right\} \in \mathbb{R}^{C \times H \times W}$ and one low-frequency image $X_{LL} \in R^{C \times H \times W}$. The three high-frequency images $X_{LH}, X_{HL}, X_{HH}$ are then merged into a 9-channel image $X_{H}$ and fed into a simple ResNet composed of FFC. The low-frequency part $X_{LL}$ is fed into our MSFormer. Note that the $X_{LL}$ at different scales are subjected to feature extraction and then fed into the MSFormer in a top-down way, where they are fused with the output of the previous encoder. Eventually, the output of the MSFormer $\hat{X_{LL}}$ and the optimized image $\hat{X_H}$ are subjected to IDWT to obtain the final result.
  • Figure 4: A visual comparison of different image enhancement methods. Methods (b) and (c) show varying degrees of over-optimization, whereas methods (d) and (g) exhibit different degrees of under-optimization. Method (f) shows relatively satisfactory optimization results, yet some optimization defects can still be observed, such as the appearance of white blocks in the middle of the last row. However, our results (h) show superior visual results in terms of color tone and details, and are the closest to the target. The first two rows of images are sourced from EUVP, while the latter two rows are obtained from UIEB.
  • Figure 5: Visual comparison of different image enhancement methods with no-reference datasets. It shows that methods (b), (c), and (e) exhibit considerable over-enhancement, while methods (d) and (g) do not fully enhance the images. Method (f) shows relatively satisfactory results albeit ambiguous optimization, such as ambiguous optimization in the first row and purple blotches in the third row. Our method (h) provides superior visual results in terms of color tone and details. The images in the rows from top to bottom, respectively originate from UIEB-60, U45, RUIE, and EUVPUN.