Table of Contents
Fetching ...

Image-Conditional Diffusion Transformer for Underwater Image Enhancement

Xingyang Nie, Su Pan, Xiaoyu Zhai, Shifei Tao, Fengzhong Qu, Biao Wang, Huilin Ge, Guojie Xiao

TL;DR

This paper tackles underwater image enhancement by leveraging a latent diffusion model conditioned on the degraded input. It introduces the Image-Conditional Diffusion Transformer (ICDT), which replaces the conventional U‑Net with a transformer backbone in a latent space diffusion framework and trains with a hybrid loss including learnt variances to enable faster sampling. Experiments on Underwater ImageNet show that larger ICDT models, particularly ICDT‑XL/2, achieve state‑of‑the‑art performance across full‑reference metrics (PSNR, SSIM, LPIPS) and non‑reference UIQM, verifying both quality and efficiency gains. The work demonstrates ICDT’s scalability and positions it as a potentially universal approach for image‑to‑image generation tasks beyond UIE.

Abstract

Underwater image enhancement (UIE) has attracted much attention owing to its importance for underwater operation and marine engineering. Motivated by the recent advance in generative models, we propose a novel UIE method based on image-conditional diffusion transformer (ICDT). Our method takes the degraded underwater image as the conditional input and converts it into latent space where ICDT is applied. ICDT replaces the conventional U-Net backbone in a denoising diffusion probabilistic model (DDPM) with a transformer, and thus inherits favorable properties such as scalability from transformers. Furthermore, we train ICDT with a hybrid loss function involving variances to achieve better log-likelihoods, which meanwhile significantly accelerates the sampling process. We experimentally assess the scalability of ICDTs and compare with prior works in UIE on the Underwater ImageNet dataset. Besides good scaling properties, our largest model, ICDT-XL/2, outperforms all comparison methods, achieving state-of-the-art (SOTA) quality of image enhancement.

Image-Conditional Diffusion Transformer for Underwater Image Enhancement

TL;DR

This paper tackles underwater image enhancement by leveraging a latent diffusion model conditioned on the degraded input. It introduces the Image-Conditional Diffusion Transformer (ICDT), which replaces the conventional U‑Net with a transformer backbone in a latent space diffusion framework and trains with a hybrid loss including learnt variances to enable faster sampling. Experiments on Underwater ImageNet show that larger ICDT models, particularly ICDT‑XL/2, achieve state‑of‑the‑art performance across full‑reference metrics (PSNR, SSIM, LPIPS) and non‑reference UIQM, verifying both quality and efficiency gains. The work demonstrates ICDT’s scalability and positions it as a potentially universal approach for image‑to‑image generation tasks beyond UIE.

Abstract

Underwater image enhancement (UIE) has attracted much attention owing to its importance for underwater operation and marine engineering. Motivated by the recent advance in generative models, we propose a novel UIE method based on image-conditional diffusion transformer (ICDT). Our method takes the degraded underwater image as the conditional input and converts it into latent space where ICDT is applied. ICDT replaces the conventional U-Net backbone in a denoising diffusion probabilistic model (DDPM) with a transformer, and thus inherits favorable properties such as scalability from transformers. Furthermore, we train ICDT with a hybrid loss function involving variances to achieve better log-likelihoods, which meanwhile significantly accelerates the sampling process. We experimentally assess the scalability of ICDTs and compare with prior works in UIE on the Underwater ImageNet dataset. Besides good scaling properties, our largest model, ICDT-XL/2, outperforms all comparison methods, achieving state-of-the-art (SOTA) quality of image enhancement.
Paper Structure (17 sections, 10 equations, 7 figures, 2 tables)

This paper contains 17 sections, 10 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The forward noising process (dashed line) and the reverse denoising process (solid line) of an image-conditional DDPM.
  • Figure 2: The ICDT architecture. Left: We train ICDT models in latent space. The converted latent is divided into patches and subsequently processed by $N$ diffusion transformer blocks. Right: Details of the diffusion transformer blocks which incorporate conditioning through adaptive layer norm.
  • Figure 3: Scaling the ICDT model improves PSNR during the whole training process. We present PSNR over training steps for all 12 of our ICDT models. In the top row, we compare PSNR against model size while keeping patch size constant. In the bottom row, we compare PSNR against patch size while holding model size constant. Scaling the transformer backbone yields better generative models across all model sizes and patch sizes.
  • Figure 4: PSNR is strongly correlated with model FLOPs. We show PSNR of each ICDT model after 90K training iterations and each model’s FLOPs.
  • Figure 5: Larger ICDT models are more compute-efficient. We present PSNR against total training compute.
  • ...and 2 more figures