Overcoming Annotation Bottlenecks in Underwater Fish Segmentation: A Robust Self-Supervised Learning Approach

Alzayat Saleh; Marcus Sheaves; Dean Jerry; Mostafa Rahimi Azghadi

Overcoming Annotation Bottlenecks in Underwater Fish Segmentation: A Robust Self-Supervised Learning Approach

Alzayat Saleh, Marcus Sheaves, Dean Jerry, Mostafa Rahimi Azghadi

TL;DR

This work tackles the annotation bottleneck in underwater fish segmentation by introducing a self-supervised, Transformer-based framework that learns robust representations from unlabeled video. The method employs a two-branch architecture with cross-view consistency and space-time self-training, supported by anchor sampling and label propagation, enabling effective segmentation across challenging underwater datasets. Trained on DeepFish and evaluated on Seagrass and YouTube-VOS, it outperforms existing self-supervised baselines and approaches fully supervised performance without annotations, while offering computational efficiency for potential edge deployment. Limitations include occlusion handling under seagrass and generalization to more diverse habitats; future work could integrate additional modalities and extend to species-level identification and other underwater objects.

Abstract

Accurate fish segmentation in underwater videos is challenging due to low visibility, variable lighting, and dynamic backgrounds, making fully-supervised methods that require manual annotation impractical for many applications. This paper introduces a novel self-supervised learning approach for fish segmentation using Deep Learning. Our model, trained without manual annotation, learns robust and generalizable representations by aligning features across augmented views and enforcing spatial-temporal consistency. We demonstrate its effectiveness on three challenging underwater video datasets: DeepFish, Seagrass, and YouTube-VOS, surpassing existing self-supervised methods and achieving segmentation accuracy comparable to fully-supervised methods without the need for costly annotations. Trained on DeepFish, our model exhibits strong generalization, achieving high segmentation accuracy on the unseen Seagrass and YouTube-VOS datasets. Furthermore, our model is computationally efficient due to its parallel processing and efficient anchor sampling technique, making it suitable for real-time applications and potential deployment on edge devices. We present quantitative results using Jaccard Index and Dice coefficient, as well as qualitative comparisons, showcasing the accuracy, robustness, and efficiency of our approach for advancing underwater video analysis

Overcoming Annotation Bottlenecks in Underwater Fish Segmentation: A Robust Self-Supervised Learning Approach

TL;DR

Abstract

Paper Structure (26 sections, 5 equations, 7 figures, 3 tables, 3 algorithms)

This paper contains 26 sections, 5 equations, 7 figures, 3 tables, 3 algorithms.

Introduction
Method
Model Architecture
Co-Scale Conv-Attention
Multilayer Perceptron (MLP)
Regularizing Branch
Anchor Sampling
Loss Function
Cross-View Consistency
Space-Time Self-Training
Final Loss
Label Propagation
Experiments
Datasets
Data Augmentation
...and 11 more sections

Figures (7)

Figure 1: The natural visual artefact dynamics provide important cues about the composition of scenes and how they change.
Figure 2: Our proposed framework consists of a single feature extractor that processes video sequences. Given a batch of unlabeled video sequences $x$, two batches of different views $v$ and $\hat{v}$ are produced and are then encoded into embeddings $y$ and $\hat{y}$ through the main branch $f_{\theta}$ and the second regularising branch $f_{\xi}$, respectively. The embeddings are fed to a multilayer perceptron (MLP) $g_{\theta}$ to produce the projections $z$ and $\hat{z}$ to compute the cross-view consistency loss $\mathcal{L}_{\text{CV}}$. The self-training loss $\mathcal{L}_{\text{ST}}$ learns space-time embeddings between the anchors $q$ and pseudo labels $p$ (arg max of $u$, affinities of $\hat{z}$w.r.t. anchors.). The two branches are identical in architecture with shared weights.
Figure 3: Schematic graph of the serial block in CoaT Transformer xu2021coat. Input feature maps are first down-sampled by a patch embedding layer and then flatten the reduced feature maps into a sequence of image tokens. Multiple Conv-Attention and Feed-Forward layers process the tokenized features, along with a class token (a vector to achieve image classification).
Figure 4: Representation Learning as similarity across views by discriminating features (i) spatially within individual frames and (ii) temporally, to represent each frame in a video sequence in terms of the same feature set.
Figure 5: Qualitative comparison between our model and a baseline Araslanov2021 model applied on the YouTube-VOS (rows 1 and 4) Xu2018b, and Seagrass (rows 2 and 3) Ditria2021a datasets. The representation learned by our model effectively distinguishes between objects and background ambiguity and is robust to occlusions.
...and 2 more figures

Overcoming Annotation Bottlenecks in Underwater Fish Segmentation: A Robust Self-Supervised Learning Approach

TL;DR

Abstract

Overcoming Annotation Bottlenecks in Underwater Fish Segmentation: A Robust Self-Supervised Learning Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (7)