Table of Contents
Fetching ...

Empowering DINO Representations for Underwater Instance Segmentation via Aligner and Prompter

Zhiyang Chen, Chen Zhang, Hao Fang, Runmin Cong

TL;DR

DiveSeg tackles underwater instance segmentation by fine-tuning a powerful foundation model, DINOv2, through two lightweight adapters: AquaStyle Aligner, which captures and injects underwater color style via Fourier amplitude and cross-attention, and ObjectPrior Prompter, which provides object-level priors using binary masks to guide instance learning. The framework yields state-of-the-art performance on UIIS and USIS10K, with substantial gains in mAP, AP50, and AP75 while keeping a modest parameter footprint. Ablation confirms the complementary benefits of both modules, and qualitative results show sharper boundaries and better handling of cluttered underwater scenes. Overall, DiveSeg demonstrates the practicality of foundation-model-based UIS with targeted, efficient domain adaptation for marine exploration and ecological protection.

Abstract

Underwater instance segmentation (UIS), integrating pixel-level understanding and instance-level discrimination, is a pivotal technology in marine resource exploration and ecological protection. In recent years, large-scale pretrained visual foundation models, exemplified by DINO, have advanced rapidly and demonstrated remarkable performance on complex downstream tasks. In this paper, we demonstrate that DINO can serve as an effective feature learner for UIS, and we introduce DiveSeg, a novel framework built upon two insightful components: (1) The AquaStyle Aligner, designed to embed underwater color style features into the DINO fine-tuning process, facilitating better adaptation to the underwater domain. (2) The ObjectPrior Prompter, which incorporates binary segmentation-based prompts to deliver object-level priors, provides essential guidance for instance segmentation task that requires both object- and instance-level reasoning. We conduct thorough experiments on the popular UIIS and USIS10K datasets, and the results show that DiveSeg achieves the state-of-the-art performance. Code: https://github.com/ettof/Diveseg.

Empowering DINO Representations for Underwater Instance Segmentation via Aligner and Prompter

TL;DR

DiveSeg tackles underwater instance segmentation by fine-tuning a powerful foundation model, DINOv2, through two lightweight adapters: AquaStyle Aligner, which captures and injects underwater color style via Fourier amplitude and cross-attention, and ObjectPrior Prompter, which provides object-level priors using binary masks to guide instance learning. The framework yields state-of-the-art performance on UIIS and USIS10K, with substantial gains in mAP, AP50, and AP75 while keeping a modest parameter footprint. Ablation confirms the complementary benefits of both modules, and qualitative results show sharper boundaries and better handling of cluttered underwater scenes. Overall, DiveSeg demonstrates the practicality of foundation-model-based UIS with targeted, efficient domain adaptation for marine exploration and ecological protection.

Abstract

Underwater instance segmentation (UIS), integrating pixel-level understanding and instance-level discrimination, is a pivotal technology in marine resource exploration and ecological protection. In recent years, large-scale pretrained visual foundation models, exemplified by DINO, have advanced rapidly and demonstrated remarkable performance on complex downstream tasks. In this paper, we demonstrate that DINO can serve as an effective feature learner for UIS, and we introduce DiveSeg, a novel framework built upon two insightful components: (1) The AquaStyle Aligner, designed to embed underwater color style features into the DINO fine-tuning process, facilitating better adaptation to the underwater domain. (2) The ObjectPrior Prompter, which incorporates binary segmentation-based prompts to deliver object-level priors, provides essential guidance for instance segmentation task that requires both object- and instance-level reasoning. We conduct thorough experiments on the popular UIIS and USIS10K datasets, and the results show that DiveSeg achieves the state-of-the-art performance. Code: https://github.com/ettof/Diveseg.

Paper Structure

This paper contains 14 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) A typical underwater imaging system: direct transmission carries useful scene information, forward scattering causes blurring, and backscattered light reduces visibility. We also present representative underwater images from UIS datasets. (b) Visual comparisons among Watermask (CNN-based method), USIS-SAM (SAM-based method), and Ours (DINO-based method).
  • Figure 2: The PCA visualization of DINOv2 and DiveSeg on natural image and underwater images. The background is removed by thresholding the first PCA component.
  • Figure 3: The overall framework of the proposed DiveSeg is illustrated as follows. First, we employ a Style Extraction module to obtain an underwater style vector. This vector is subsequently injected into the frozen DINOv2 backbone via the Style Injection module, enabling rapid adaptation to the underwater domain. Together, these two modules constitute the AquaStyle Aligner. In addition, the ObjectPrior Prompter leverages binary masks to learn object-level priors, which guide the network to focus on underwater objects and ease the challenge of directly segmenting specific instances.
  • Figure 4: Underwater images and the corresponding style images, FT and iFT represents Fourier transform and inverse Fourier transform.
  • Figure 5: Qualitative comparisons of DiveSeg with SOTA UIS methods on the USIS10K and UIIS datasets.