AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis
Basit Alawode, Iyyakutti Iyappan Ganapathi, Sajid Javed, Naoufel Werghi, Mohammed Bennamoun, Arif Mahmood
TL;DR
AquaticCLIP addresses the lack of robust underwater vision-language models by pre-training on a large, domain-specific 2M image-text dataset and introducing a dual-encoder architecture with a prompt-guided image encoder and a vision-guided text encoder. The model leverages unsupervised description generation via MarineGPT and a textual cleaning module to create rich, context-aware descriptions that align with visual content through a cross-modal contrastive loss. Extensive zero-shot and fine-tuned evaluations across marine species, coral, fish, segmentation, detection, and counting tasks demonstrate significant gains over SOTA methods in aquatic settings and highlight the method's potential for scalable, zero-shot underwater scene analysis. The work offers a practical path toward robust biodiversity monitoring and conservation support, delivering both a public dataset and a capable foundation model for underwater multimodal understanding.
Abstract
The preservation of aquatic biodiversity is critical in mitigating the effects of climate change. Aquatic scene understanding plays a pivotal role in aiding marine scientists in their decision-making processes. In this paper, we introduce AquaticCLIP, a novel contrastive language-image pre-training model tailored for aquatic scene understanding. AquaticCLIP presents a new unsupervised learning framework that aligns images and texts in aquatic environments, enabling tasks such as segmentation, classification, detection, and object counting. By leveraging our large-scale underwater image-text paired dataset without the need for ground-truth annotations, our model enriches existing vision-language models in the aquatic domain. For this purpose, we construct a 2 million underwater image-text paired dataset using heterogeneous resources, including YouTube, Netflix, NatGeo, etc. To fine-tune AquaticCLIP, we propose a prompt-guided vision encoder that progressively aggregates patch features via learnable prompts, while a vision-guided mechanism enhances the language encoder by incorporating visual context. The model is optimized through a contrastive pretraining loss to align visual and textual modalities. AquaticCLIP achieves notable performance improvements in zero-shot settings across multiple underwater computer vision tasks, outperforming existing methods in both robustness and interpretability. Our model sets a new benchmark for vision-language applications in underwater environments. The code and dataset for AquaticCLIP are publicly available on GitHub at xxx.
