Table of Contents
Fetching ...

Are All Marine Species Created Equal? Performance Disparities in Underwater Object Detection

Melanie Wille, Tobias Fischer, Scarlett Raine

TL;DR

This work investigates why certain marine species are detected more reliably than others in underwater imagery by decomposing object detection into localization and classification. It systematically manipulates the DUO and RUOD-4C datasets to separate effects of data quantity from intrinsic visual features, and uses YOLO11n along with the TIDE failure-analysis toolkit to diagnose errors. The analysis reveals that foreground-background discrimination is the main bottleneck in localization, while intrinsic feature-based challenges and inter-class dependencies drive persistent classification gaps even under balanced data. Practically, the study recommends distribution-aware training (imbalanced for high precision, balanced for high recall) and emphasizes targeted localization improvements, with open-source code and datasets to enable reproducibility and further research.

Abstract

Underwater object detection is critical for monitoring marine ecosystems but poses unique challenges, including degraded image quality, imbalanced class distribution, and distinct visual characteristics. Not every species is detected equally well, yet underlying causes remain unclear. We address two key research questions: 1) What factors beyond data quantity drive class-specific performance disparities? 2) How can we systematically improve detection of under-performing marine species? We manipulate the DUO and RUOD datasets to separate the object detection task into localization and classification and investigate the under-performance of the scallop class. Localization analysis using YOLO11 and TIDE finds that foreground-background discrimination is the most problematic stage regardless of data quantity. Classification experiments reveal persistent precision gaps even with balanced data, indicating intrinsic feature-based challenges beyond data scarcity and inter-class dependencies. We recommend imbalanced distributions when prioritizing precision, and balanced distributions when prioritizing recall. Improving under-performing classes should focus on algorithmic advances, especially within localization modules. We publicly release our code and datasets.

Are All Marine Species Created Equal? Performance Disparities in Underwater Object Detection

TL;DR

This work investigates why certain marine species are detected more reliably than others in underwater imagery by decomposing object detection into localization and classification. It systematically manipulates the DUO and RUOD-4C datasets to separate effects of data quantity from intrinsic visual features, and uses YOLO11n along with the TIDE failure-analysis toolkit to diagnose errors. The analysis reveals that foreground-background discrimination is the main bottleneck in localization, while intrinsic feature-based challenges and inter-class dependencies drive persistent classification gaps even under balanced data. Practically, the study recommends distribution-aware training (imbalanced for high precision, balanced for high recall) and emphasizes targeted localization improvements, with open-source code and datasets to enable reproducibility and further research.

Abstract

Underwater object detection is critical for monitoring marine ecosystems but poses unique challenges, including degraded image quality, imbalanced class distribution, and distinct visual characteristics. Not every species is detected equally well, yet underlying causes remain unclear. We address two key research questions: 1) What factors beyond data quantity drive class-specific performance disparities? 2) How can we systematically improve detection of under-performing marine species? We manipulate the DUO and RUOD datasets to separate the object detection task into localization and classification and investigate the under-performance of the scallop class. Localization analysis using YOLO11 and TIDE finds that foreground-background discrimination is the most problematic stage regardless of data quantity. Classification experiments reveal persistent precision gaps even with balanced data, indicating intrinsic feature-based challenges beyond data scarcity and inter-class dependencies. We recommend imbalanced distributions when prioritizing precision, and balanced distributions when prioritizing recall. Improving under-performing classes should focus on algorithmic advances, especially within localization modules. We publicly release our code and datasets.

Paper Structure

This paper contains 18 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Example underwater image from the DUO dataset liu2021dataset, showing the four target classes (top row): Holothurian, Starfish, Scallop and Echinus. Left middle: the original class distribution of DUO shown by the lighter colored bars on top, and the DUO dataset with down-sampled, class-balanced distribution in the solid bars below. Left bottom: localization performance (mAP@0.5) on single-class versions of DUO. The lightened top bars indicate the localization performance for each class under the original distribution, and the solid colored bars below demonstrates there is still a clear performance difference between the classes, even when there is an equal number of training instances.
  • Figure 2: Visualization of TIDE error types on an example DUO image. Ground truth is indicated in grey, predicted bounding boxes in color. The error types are: "Missed GT" (completely undetected ground truth, causes FN), "Classification - Cls" (correct box location but wrong class, causes FP & FN), "Classification and Localization" (wrong class and insufficient box alignment, causes FP & FN), "Background - Bkg" (background predicted as object, causes FP), "Duplicate - Dupe" (duplicate prediction for already matched ground truth, causes FP), and "Localization - Loc" (correct class but insufficient IoU overlap with ground truth, causes FP & FN). Classification, localization, and combined errors are formally counted as false positives in TIDE but also represent false negatives since they leave ground truths unmatched.
  • Figure 3: Per class localization performance compared across datasets. Bar charts indicate mAP@0.5 for the originally imbalanced and balanced data, line charts represent gradually reduced sets. For both DUO (a) and RUOD-4C (b), scallop performance is the same in the original and balanced sets (all classes in the balanced set are downsampled to the scallop instance count), but still lower than all other classes, despite the same number of instances. Further reductions of the balanced set affect classes differently: starfish and echinus are impacted less by the decrease, whereas holothurian and scallop exhibit significant reduction in mAP.
  • Figure 4: Distribution of TIDE error types across all classes for the original (red) and balanced (green) DUO datasets. The imbalanced versions exhibit higher rates of FPs, and specifically background error, for Starfish, echinus and holothurian classes. In contrast, balanced datasets make the error profiles for these classes more similar to the scallop class, i.e. there are more missed ground truth detections. Only training on object-containing images, as seen in the Scallop Limited bars (blue), reduces misses but massively increases background error.
  • Figure 5: Accuracy of predicted bounding boxes for DUO. (a) Bounding box center offset, with minor deviations but overall similar distribution across classes and datasets. (b) The bounding box area error, which highlights larger fluctuations for the small scallop objects but a similar trend overall.
  • ...and 2 more figures