Overcoming Annotation Bottlenecks in Underwater Fish Segmentation: A Robust Self-Supervised Learning Approach
Alzayat Saleh, Marcus Sheaves, Dean Jerry, Mostafa Rahimi Azghadi
TL;DR
This work tackles the annotation bottleneck in underwater fish segmentation by introducing a self-supervised, Transformer-based framework that learns robust representations from unlabeled video. The method employs a two-branch architecture with cross-view consistency and space-time self-training, supported by anchor sampling and label propagation, enabling effective segmentation across challenging underwater datasets. Trained on DeepFish and evaluated on Seagrass and YouTube-VOS, it outperforms existing self-supervised baselines and approaches fully supervised performance without annotations, while offering computational efficiency for potential edge deployment. Limitations include occlusion handling under seagrass and generalization to more diverse habitats; future work could integrate additional modalities and extend to species-level identification and other underwater objects.
Abstract
Accurate fish segmentation in underwater videos is challenging due to low visibility, variable lighting, and dynamic backgrounds, making fully-supervised methods that require manual annotation impractical for many applications. This paper introduces a novel self-supervised learning approach for fish segmentation using Deep Learning. Our model, trained without manual annotation, learns robust and generalizable representations by aligning features across augmented views and enforcing spatial-temporal consistency. We demonstrate its effectiveness on three challenging underwater video datasets: DeepFish, Seagrass, and YouTube-VOS, surpassing existing self-supervised methods and achieving segmentation accuracy comparable to fully-supervised methods without the need for costly annotations. Trained on DeepFish, our model exhibits strong generalization, achieving high segmentation accuracy on the unseen Seagrass and YouTube-VOS datasets. Furthermore, our model is computationally efficient due to its parallel processing and efficient anchor sampling technique, making it suitable for real-time applications and potential deployment on edge devices. We present quantitative results using Jaccard Index and Dice coefficient, as well as qualitative comparisons, showcasing the accuracy, robustness, and efficiency of our approach for advancing underwater video analysis
