Towards Generating Realistic Underwater Images
Abdul-Kazeem Shamba
TL;DR
This work tackles generating realistic underwater imagery from synthetic scenes with uniform lighting by comparing paired and unpaired image translation methods, including the integration of depth information into a contrastive learning framework. On the VAROS dataset, paired translation favors pix2pix for sharp, high-frequency detail, while autoencoders preserve structural similarity but produce blurrier outputs. Among unpaired methods, CycleGAN achieves strong FID performance, CUT improves structural fidelity via patchwise contrastive losses, and incorporating depth into CUT yields the lowest FID, albeit with a slight SSIM drop. The results illuminate practical trade-offs between perceptual realism and content preservation, with depth cues and contrastive objectives offering promising gains for realistic underwater data generation in marine robotics applications.
Abstract
This paper explores the use of contrastive learning and generative adversarial networks for generating realistic underwater images from synthetic images with uniform lighting. We investigate the performance of image translation models for generating realistic underwater images using the VAROS dataset. Two key evaluation metrics, Fréchet Inception Distance (FID) and Structural Similarity Index Measure (SSIM), provide insights into the trade-offs between perceptual quality and structural preservation. For paired image translation, pix2pix achieves the best FID scores due to its paired supervision and PatchGAN discriminator, while the autoencoder model attains the highest SSIM, suggesting better structural fidelity despite producing blurrier outputs. Among unpaired methods, CycleGAN achieves a competitive FID score by leveraging cycle-consistency loss, whereas CUT, which replaces cycle-consistency with contrastive learning, attains higher SSIM, indicating improved spatial similarity retention. Notably, incorporating depth information into CUT results in the lowest overall FID score, demonstrating that depth cues enhance realism. However, the slight decrease in SSIM suggests that depth-aware learning may introduce structural variations.
