Fantastic Animals and Where to Find Them: Segment Any Marine Animal with Dual SAM
Pingping Zhang, Tianyu Yan, Yang Liu, Huchuan Lu
TL;DR
This work addresses Marine Animal Segmentation (MAS) by adapting the Segment Anything Model (SAM) to underwater domains. It introduces a Dual-SAM Encoder to inject marine priors via gamma-corrected imagery and adapters, and pairs it with Multi-level Coupled Prompts, a Dilated Fusion Attention Module, and Criss-Cross Connectivity Prediction to capture structured connectivity beyond pixel-wise masks. Pseudo-label Mutual Supervision enables mutual refinement between dual decoders, yielding consistent, state-of-the-art MAS performance across five datasets. The approach demonstrates strong transferability and robustness to zero-shot scenarios, highlighting SAM's potential when domain-specific priors and decoding strategies are embedded for underwater perception.
Abstract
As an important pillar of underwater intelligence, Marine Animal Segmentation (MAS) involves segmenting animals within marine environments. Previous methods don't excel in extracting long-range contextual features and overlook the connectivity between discrete pixels. Recently, Segment Anything Model (SAM) offers a universal framework for general segmentation tasks. Unfortunately, trained with natural images, SAM does not obtain the prior knowledge from marine images. In addition, the single-position prompt of SAM is very insufficient for prior guidance. To address these issues, we propose a novel feature learning framework, named Dual-SAM for high-performance MAS. To this end, we first introduce a dual structure with SAM's paradigm to enhance feature learning of marine images. Then, we propose a Multi-level Coupled Prompt (MCP) strategy to instruct comprehensive underwater prior information, and enhance the multi-level features of SAM's encoder with adapters. Subsequently, we design a Dilated Fusion Attention Module (DFAM) to progressively integrate multi-level features from SAM's encoder. Finally, instead of directly predicting the masks of marine animals, we propose a Criss-Cross Connectivity Prediction (C$^3$P) paradigm to capture the inter-connectivity between discrete pixels. With dual decoders, it generates pseudo-labels and achieves mutual supervision for complementary feature representations, resulting in considerable improvements over previous techniques. Extensive experiments verify that our proposed method achieves state-of-the-art performances on five widely-used MAS datasets. The code is available at https://github.com/Drchip61/Dual_SAM.
