Table of Contents
Fetching ...

Overcoming Annotation Bottlenecks in Underwater Fish Segmentation: A Robust Self-Supervised Learning Approach

Alzayat Saleh, Marcus Sheaves, Dean Jerry, Mostafa Rahimi Azghadi

TL;DR

This work tackles the annotation bottleneck in underwater fish segmentation by introducing a self-supervised, Transformer-based framework that learns robust representations from unlabeled video. The method employs a two-branch architecture with cross-view consistency and space-time self-training, supported by anchor sampling and label propagation, enabling effective segmentation across challenging underwater datasets. Trained on DeepFish and evaluated on Seagrass and YouTube-VOS, it outperforms existing self-supervised baselines and approaches fully supervised performance without annotations, while offering computational efficiency for potential edge deployment. Limitations include occlusion handling under seagrass and generalization to more diverse habitats; future work could integrate additional modalities and extend to species-level identification and other underwater objects.

Abstract

Accurate fish segmentation in underwater videos is challenging due to low visibility, variable lighting, and dynamic backgrounds, making fully-supervised methods that require manual annotation impractical for many applications. This paper introduces a novel self-supervised learning approach for fish segmentation using Deep Learning. Our model, trained without manual annotation, learns robust and generalizable representations by aligning features across augmented views and enforcing spatial-temporal consistency. We demonstrate its effectiveness on three challenging underwater video datasets: DeepFish, Seagrass, and YouTube-VOS, surpassing existing self-supervised methods and achieving segmentation accuracy comparable to fully-supervised methods without the need for costly annotations. Trained on DeepFish, our model exhibits strong generalization, achieving high segmentation accuracy on the unseen Seagrass and YouTube-VOS datasets. Furthermore, our model is computationally efficient due to its parallel processing and efficient anchor sampling technique, making it suitable for real-time applications and potential deployment on edge devices. We present quantitative results using Jaccard Index and Dice coefficient, as well as qualitative comparisons, showcasing the accuracy, robustness, and efficiency of our approach for advancing underwater video analysis

Overcoming Annotation Bottlenecks in Underwater Fish Segmentation: A Robust Self-Supervised Learning Approach

TL;DR

This work tackles the annotation bottleneck in underwater fish segmentation by introducing a self-supervised, Transformer-based framework that learns robust representations from unlabeled video. The method employs a two-branch architecture with cross-view consistency and space-time self-training, supported by anchor sampling and label propagation, enabling effective segmentation across challenging underwater datasets. Trained on DeepFish and evaluated on Seagrass and YouTube-VOS, it outperforms existing self-supervised baselines and approaches fully supervised performance without annotations, while offering computational efficiency for potential edge deployment. Limitations include occlusion handling under seagrass and generalization to more diverse habitats; future work could integrate additional modalities and extend to species-level identification and other underwater objects.

Abstract

Accurate fish segmentation in underwater videos is challenging due to low visibility, variable lighting, and dynamic backgrounds, making fully-supervised methods that require manual annotation impractical for many applications. This paper introduces a novel self-supervised learning approach for fish segmentation using Deep Learning. Our model, trained without manual annotation, learns robust and generalizable representations by aligning features across augmented views and enforcing spatial-temporal consistency. We demonstrate its effectiveness on three challenging underwater video datasets: DeepFish, Seagrass, and YouTube-VOS, surpassing existing self-supervised methods and achieving segmentation accuracy comparable to fully-supervised methods without the need for costly annotations. Trained on DeepFish, our model exhibits strong generalization, achieving high segmentation accuracy on the unseen Seagrass and YouTube-VOS datasets. Furthermore, our model is computationally efficient due to its parallel processing and efficient anchor sampling technique, making it suitable for real-time applications and potential deployment on edge devices. We present quantitative results using Jaccard Index and Dice coefficient, as well as qualitative comparisons, showcasing the accuracy, robustness, and efficiency of our approach for advancing underwater video analysis
Paper Structure (26 sections, 5 equations, 7 figures, 3 tables, 3 algorithms)

This paper contains 26 sections, 5 equations, 7 figures, 3 tables, 3 algorithms.

Figures (7)

  • Figure 1: The natural visual artefact dynamics provide important cues about the composition of scenes and how they change.
  • Figure 2: Our proposed framework consists of a single feature extractor that processes video sequences. Given a batch of unlabeled video sequences $x$, two batches of different views $v$ and $\hat{v}$ are produced and are then encoded into embeddings $y$ and $\hat{y}$ through the main branch $f_{\theta}$ and the second regularising branch $f_{\xi}$, respectively. The embeddings are fed to a multilayer perceptron (MLP) $g_{\theta}$ to produce the projections $z$ and $\hat{z}$ to compute the cross-view consistency loss $\mathcal{L}_{\text{CV}}$. The self-training loss $\mathcal{L}_{\text{ST}}$ learns space-time embeddings between the anchors $q$ and pseudo labels $p$ (arg max of $u$, affinities of $\hat{z}$w.r.t. anchors.). The two branches are identical in architecture with shared weights.
  • Figure 3: Schematic graph of the serial block in CoaT Transformer xu2021coat. Input feature maps are first down-sampled by a patch embedding layer and then flatten the reduced feature maps into a sequence of image tokens. Multiple Conv-Attention and Feed-Forward layers process the tokenized features, along with a class token (a vector to achieve image classification).
  • Figure 4: Representation Learning as similarity across views by discriminating features (i) spatially within individual frames and (ii) temporally, to represent each frame in a video sequence in terms of the same feature set.
  • Figure 5: Qualitative comparison between our model and a baseline Araslanov2021 model applied on the YouTube-VOS (rows 1 and 4) Xu2018b, and Seagrass (rows 2 and 3) Ditria2021a datasets. The representation learned by our model effectively distinguishes between objects and background ambiguity and is robust to occlusions.
  • ...and 2 more figures