Table of Contents
Fetching ...

AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis

Basit Alawode, Iyyakutti Iyappan Ganapathi, Sajid Javed, Naoufel Werghi, Mohammed Bennamoun, Arif Mahmood

TL;DR

AquaticCLIP addresses the lack of robust underwater vision-language models by pre-training on a large, domain-specific 2M image-text dataset and introducing a dual-encoder architecture with a prompt-guided image encoder and a vision-guided text encoder. The model leverages unsupervised description generation via MarineGPT and a textual cleaning module to create rich, context-aware descriptions that align with visual content through a cross-modal contrastive loss. Extensive zero-shot and fine-tuned evaluations across marine species, coral, fish, segmentation, detection, and counting tasks demonstrate significant gains over SOTA methods in aquatic settings and highlight the method's potential for scalable, zero-shot underwater scene analysis. The work offers a practical path toward robust biodiversity monitoring and conservation support, delivering both a public dataset and a capable foundation model for underwater multimodal understanding.

Abstract

The preservation of aquatic biodiversity is critical in mitigating the effects of climate change. Aquatic scene understanding plays a pivotal role in aiding marine scientists in their decision-making processes. In this paper, we introduce AquaticCLIP, a novel contrastive language-image pre-training model tailored for aquatic scene understanding. AquaticCLIP presents a new unsupervised learning framework that aligns images and texts in aquatic environments, enabling tasks such as segmentation, classification, detection, and object counting. By leveraging our large-scale underwater image-text paired dataset without the need for ground-truth annotations, our model enriches existing vision-language models in the aquatic domain. For this purpose, we construct a 2 million underwater image-text paired dataset using heterogeneous resources, including YouTube, Netflix, NatGeo, etc. To fine-tune AquaticCLIP, we propose a prompt-guided vision encoder that progressively aggregates patch features via learnable prompts, while a vision-guided mechanism enhances the language encoder by incorporating visual context. The model is optimized through a contrastive pretraining loss to align visual and textual modalities. AquaticCLIP achieves notable performance improvements in zero-shot settings across multiple underwater computer vision tasks, outperforming existing methods in both robustness and interpretability. Our model sets a new benchmark for vision-language applications in underwater environments. The code and dataset for AquaticCLIP are publicly available on GitHub at xxx.

AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis

TL;DR

AquaticCLIP addresses the lack of robust underwater vision-language models by pre-training on a large, domain-specific 2M image-text dataset and introducing a dual-encoder architecture with a prompt-guided image encoder and a vision-guided text encoder. The model leverages unsupervised description generation via MarineGPT and a textual cleaning module to create rich, context-aware descriptions that align with visual content through a cross-modal contrastive loss. Extensive zero-shot and fine-tuned evaluations across marine species, coral, fish, segmentation, detection, and counting tasks demonstrate significant gains over SOTA methods in aquatic settings and highlight the method's potential for scalable, zero-shot underwater scene analysis. The work offers a practical path toward robust biodiversity monitoring and conservation support, delivering both a public dataset and a capable foundation model for underwater multimodal understanding.

Abstract

The preservation of aquatic biodiversity is critical in mitigating the effects of climate change. Aquatic scene understanding plays a pivotal role in aiding marine scientists in their decision-making processes. In this paper, we introduce AquaticCLIP, a novel contrastive language-image pre-training model tailored for aquatic scene understanding. AquaticCLIP presents a new unsupervised learning framework that aligns images and texts in aquatic environments, enabling tasks such as segmentation, classification, detection, and object counting. By leveraging our large-scale underwater image-text paired dataset without the need for ground-truth annotations, our model enriches existing vision-language models in the aquatic domain. For this purpose, we construct a 2 million underwater image-text paired dataset using heterogeneous resources, including YouTube, Netflix, NatGeo, etc. To fine-tune AquaticCLIP, we propose a prompt-guided vision encoder that progressively aggregates patch features via learnable prompts, while a vision-guided mechanism enhances the language encoder by incorporating visual context. The model is optimized through a contrastive pretraining loss to align visual and textual modalities. AquaticCLIP achieves notable performance improvements in zero-shot settings across multiple underwater computer vision tasks, outperforming existing methods in both robustness and interpretability. Our model sets a new benchmark for vision-language applications in underwater environments. The code and dataset for AquaticCLIP are publicly available on GitHub at xxx.

Paper Structure

This paper contains 49 sections, 7 equations, 18 figures, 13 tables.

Figures (18)

  • Figure 1: (a) Step 1:Two Million Aquatic Image-Text Pairs. The dataset consists of paired aquatic images and enriched textual descriptions, which serve as input to the model. (b) Step 2:Contrastive Loss Pretraining. Text and image pairs are processed by a text encoder and an image encoder. The embeddings are aligned through contrastive loss, reducing the distance between matching pairs and improving the model's ability to associate images with their corresponding textual descriptions. (c) Step 3:Downstream Analysis. AquaticCLIP performance is evaluated across various tasks such as zero-shot marine species classification, fine-tuned instance and semantic segmentation, object detection, and biodiversity counting in underwater imagery.
  • Figure 2: Overview of AquaticCLIP architecture and training process. (a) Shows a set of input image-text pairs. (b) A caption model (MarineGPT) generates textual descriptions for the images. (c) Input images are divided into patches and processed by the image encoder $\Phi_v$ to produce patch embeddings $\textbf{P}_{i}$. (d) The generated textual descriptions $\textbf{S}_{i}$ are processed by the text encoder $\Phi_{t}$ to produce text embedings. (e)-(f) The textual description $\textbf{S}_{i}$ is then cleaned by an image-text caption cleaning module to produce refined descriptions $\hat{\textbf{S}}_{i}$ which are then combined with groundtruth descriptions $\textbf{G}_{i}$ to produce enriched textual description data $\textbf{C}_{i}$ . Both image and text embeddings are refined using (h) vision-guided text encoding and (g) prompt-guided vision encoding. The learned prompts $\textbf{E}_i$ guide the fusion of patch embeddings, while initialized prompts $\textbf{Q}_{i}$ are used to enhance the visual representation. (i) The final image and text features are aligned using a cross-modal contrastive pre-training loss $\mathcal{L}_{cont}$, ensuring a stronger association between text and image representations.
  • Figure 3: (a) Prompt-Guided Vision Encoder: The prompt-guided attention mechanism combines patch features $\textbf{P}_{i}$ with initialized prompts $\textbf{Q}_{i}$ through layer normalization and an MLP, followed by softmax to produce the final image features $\textbf{f}_{i}$. (b) Vision-Guided Text Encoder: Text embeddings $\textbf{T}_{i}$ are refined using a vision-guided attention mechanism, where patch features $\textbf{P}_{i}$, learned prompts $\textbf{E}_{i}$, and text embeddings $\textbf{T}_{i}$ are concatenated to compute attention $\textbf{U}_{i}$, which further enhances $\textbf{T}_{i}$.
  • Figure 4: Exemplary image-text pairs from our 2 million aquatic image-text paired dataset.
  • Figure 5: Exemplary image-text pairs from our 2 Million aquatic image-text paired dataset.
  • ...and 13 more figures