Table of Contents
Fetching ...

Label-efficient underwater species classification with semi-supervised learning on frozen foundation model embeddings

Thomas Manuel Rost

Abstract

Automated species classification from underwater imagery is bottlenecked by the cost of expert annotation, and supervised models trained on one dataset rarely transfer to new conditions. We investigate whether semi-supervised methods operating on frozen foundation model embeddings can close this annotation gap with minimal labeling effort. Using DINOv3 ViT-B embeddings with no fine-tuning, we propagate a small set of labeled seeds through unlabeled data via nearest-neighbor-based self-training and evaluate on the AQUA20 benchmark (20 marine species). With fewer than 5% of the training labels, self-training on frozen embeddings closes much of the gap to a fully supervised ConvNeXt baseline trained on the entire labeled dataset; at full supervision, the gap narrows to a few percentage points, with several species exceeding the supervised baseline. Class separability in the embedding space, measured by ROC-AUC, is high even at extreme label scarcity, indicating that the frozen representations capture discriminative structure well before decision boundaries can be reliably estimated. Our approach requires no training, no domain-specific data engineering, and no underwater-adapted models, establishing a practical, immediately deployable baseline for label-efficient marine species recognition. All results are reported on the held-out test set over 100 random seed initializations.

Label-efficient underwater species classification with semi-supervised learning on frozen foundation model embeddings

Abstract

Automated species classification from underwater imagery is bottlenecked by the cost of expert annotation, and supervised models trained on one dataset rarely transfer to new conditions. We investigate whether semi-supervised methods operating on frozen foundation model embeddings can close this annotation gap with minimal labeling effort. Using DINOv3 ViT-B embeddings with no fine-tuning, we propagate a small set of labeled seeds through unlabeled data via nearest-neighbor-based self-training and evaluate on the AQUA20 benchmark (20 marine species). With fewer than 5% of the training labels, self-training on frozen embeddings closes much of the gap to a fully supervised ConvNeXt baseline trained on the entire labeled dataset; at full supervision, the gap narrows to a few percentage points, with several species exceeding the supervised baseline. Class separability in the embedding space, measured by ROC-AUC, is high even at extreme label scarcity, indicating that the frozen representations capture discriminative structure well before decision boundaries can be reliably estimated. Our approach requires no training, no domain-specific data engineering, and no underwater-adapted models, establishing a practical, immediately deployable baseline for label-efficient marine species recognition. All results are reported on the held-out test set over 100 random seed initializations.

Paper Structure

This paper contains 20 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Sample images from each of the 20 AQUA20 species categories, illustrating the diversity of morphology, colouration, and imaging conditions across the dataset.
  • Figure 2: Sample images from each of the 20 AQUA20 species categories from the official train (left) and test (right) splits, illustrating the diversity of morphology, colouration, and imaging conditions across the dataset.
  • Figure 3: Overview of the experimental pipeline. Frozen DINOv3 embeddings are extracted for all images, reduced via PCA to 128 dimensions, and classified using either Self-Train KNN or a KNN Baseline under varying label budgets.
  • Figure 4: t-SNE projection of frozen DINOv3 test-set embeddings (PCA-128). Left: ground-truth species labels. Right: predictions from the best method at full supervision. Species form well-separated clusters in the frozen embedding space; the predicted labels closely reproduce the true structure, confirming that the DINOv3 representation captures the discriminative information exploited by end-to-end supervised models.
  • Figure 5: t-SNE projection of the test set across label budgets. The leftmost panel shows ground-truth species (for spatial reference); subsequent panels show correct predictions (gray) and misclassifications (red) at 1, 5, 15 labels per class and full supervision. Errors concentrate at cluster boundaries and between visually similar species, diminishing as the label budget grows.
  • ...and 6 more figures