Table of Contents
Fetching ...

An interpretable approach to automating the assessment of biofouling in video footage

Evelyn J. Mannix, Bartholomew A. Woodham

TL;DR

This work tackles the need for interpretable, scalable automated biofouling assessment from underwater imagery. It introduces ComFe, an interpretable-by-design approach built on a frozen DINOv2 Vision Transformer backbone, which identifies region-level component features and matches them to class prototypes to predict fouling with visual explanations. The approach outperforms prior CNN-based methods, supports summarizing ROV video through representative frames, and correlates predicted coverage with the SLoF severity scale, offering practical deployment guidance and data/code transparency. The results have direct implications for faster, more trustworthy biosecurity assessments and potential vessel-level fouling estimation in real-world regulatory contexts.

Abstract

Biofouling$\unicode{x2013}$communities of organisms that grow on hard surfaces immersed in water$\unicode{x2013}$provides a pathway for the spread of invasive marine species and diseases. To address this risk, international vessels are increasingly being obligated to provide evidence of their biofouling management practices. Verification that these activities are effective requires underwater inspections, using divers or underwater remotely operated vehicles (ROVs), and the collection and analysis of large amounts of imagery and footage. Automated assessment using computer vision techniques can significantly streamline this process, and this work shows how this challenge can be addressed efficiently and effectively using the interpretable Component Features (ComFe) approach with a DINOv2 Vision Transformer (ViT) foundation model. ComFe is able to obtain improved performance in comparison to previous non-interpretable Convolutional Neural Network (CNN) methods, with significantly fewer weights and greater transparency$\unicode{x2013}$through identifying which regions of the image contribute to the classification, and which images in the training data lead to that conclusion. All code, data and model weights are publicly released.

An interpretable approach to automating the assessment of biofouling in video footage

TL;DR

This work tackles the need for interpretable, scalable automated biofouling assessment from underwater imagery. It introduces ComFe, an interpretable-by-design approach built on a frozen DINOv2 Vision Transformer backbone, which identifies region-level component features and matches them to class prototypes to predict fouling with visual explanations. The approach outperforms prior CNN-based methods, supports summarizing ROV video through representative frames, and correlates predicted coverage with the SLoF severity scale, offering practical deployment guidance and data/code transparency. The results have direct implications for faster, more trustworthy biosecurity assessments and potential vessel-level fouling estimation in real-world regulatory contexts.

Abstract

Biofoulingcommunities of organisms that grow on hard surfaces immersed in waterprovides a pathway for the spread of invasive marine species and diseases. To address this risk, international vessels are increasingly being obligated to provide evidence of their biofouling management practices. Verification that these activities are effective requires underwater inspections, using divers or underwater remotely operated vehicles (ROVs), and the collection and analysis of large amounts of imagery and footage. Automated assessment using computer vision techniques can significantly streamline this process, and this work shows how this challenge can be addressed efficiently and effectively using the interpretable Component Features (ComFe) approach with a DINOv2 Vision Transformer (ViT) foundation model. ComFe is able to obtain improved performance in comparison to previous non-interpretable Convolutional Neural Network (CNN) methods, with significantly fewer weights and greater transparencythrough identifying which regions of the image contribute to the classification, and which images in the training data lead to that conclusion. All code, data and model weights are publicly released.

Paper Structure

This paper contains 20 sections, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Illustration of ComFe. The image is first clustered into component features, which represent different regions with similar content. The green component feature shows the vessel hull, blue captures the sea chest grating, teal highlights the image tags, and the yellow and red component features identify the regions of the image with biofouling. These are then compared to class prototypes, which match as expected as visualised by the exemplars---training images in the dataset with similar embeddings to the fitted class prototypes. This comparison between component features and class prototypes is used to identify the salient parts of the image for predicting if biofouling is present, as shown by the class confidence heatmap in the final image, where the red regions show areas in the image with higher confidence of fouling being present.
  • Figure 2: Example training (first row) and testing (second row) images from the biofouling dataset.mannix2021automating
  • Figure 3: Example detections of biofouling versus not biofouling within an image using the DINOv2 ViT-B/14 w/reg network with a ComFe head. The red regions show areas with a high confidence of fouling being present, and the blue regions highlight areas of zero confidence.
  • Figure 4: Predicted coverage within an image of fouling and paint damage by ComFe versus label categories for fouling and paint damage severity.
  • Figure 5: Precision-recall curve for the best ComFe model using the DINOv2 ViT-B/14 (f) w/reg backbone for detecting the presence of biofouling.
  • ...and 9 more figures