Table of Contents
Fetching ...

Human-in-the-Loop Segmentation of Multi-species Coral Imagery

Scarlett Raine, Ross Marchant, Brano Kusy, Frederic Maire, Niko Suenderhauf, Tobias Fischer

TL;DR

This work tackles the bottleneck of annotating multi-species coral imagery for semantic segmentation by proposing a two-stage pipeline that first propagates very sparse point labels using $768$-dimensional embeddings from the denoised DINOv2 backbone with a $k=1$ nearest-neighbor rule to generate augmented ground truth, and then trains segmentation models on these masks. A novel human-in-the-loop labeling strategy guides the selection of informative labeling points by combining feature-based uncertainty with spatial exploration, leading to large gains at 5–25 point budgets. The method achieves up to $+19.7\%$ mIoU improvement over prior state-of-the-art for Stage One and $+13.5\%$ mIoU (and $+8.8\%$ PA) for end-to-end segmentation at 5 points, while reducing annotation costs and enabling rapid adaptation to new coral environments. These results demonstrate the practical potential of leveraging general foundation-model representations in domain-specific ecological monitoring, especially when expert labeling is expensive or scarce.

Abstract

Marine surveys by robotic underwater and surface vehicles result in substantial quantities of coral reef imagery, however labeling these images is expensive and time-consuming for domain experts. Point label propagation is a technique that uses existing images labeled with sparse points to create augmented ground truth data, which can be used to train a semantic segmentation model. In this work, we show that recent advances in large foundation models facilitate the creation of augmented ground truth masks using only features extracted by the denoised version of the DINOv2 foundation model and K-Nearest Neighbors (KNN), without any pre-training. For images with extremely sparse labels, we use human-in-the-loop principles to enhance annotation efficiency: if there are 5 point labels per image, our method outperforms the prior state-of-the-art by 19.7% for mIoU. When human-in-the-loop labeling is not available, using the denoised DINOv2 features with a KNN still improves on the prior state-of-the-art by 5.8% for mIoU (5 grid points). On the semantic segmentation task, we outperform the prior state-of-the-art by 13.5% for mIoU when only 5 point labels are used for point label propagation. Additionally, we perform a comprehensive study into the number and placement of point labels, and make several recommendations for improving the efficiency of labeling images with points.

Human-in-the-Loop Segmentation of Multi-species Coral Imagery

TL;DR

This work tackles the bottleneck of annotating multi-species coral imagery for semantic segmentation by proposing a two-stage pipeline that first propagates very sparse point labels using -dimensional embeddings from the denoised DINOv2 backbone with a nearest-neighbor rule to generate augmented ground truth, and then trains segmentation models on these masks. A novel human-in-the-loop labeling strategy guides the selection of informative labeling points by combining feature-based uncertainty with spatial exploration, leading to large gains at 5–25 point budgets. The method achieves up to mIoU improvement over prior state-of-the-art for Stage One and mIoU (and PA) for end-to-end segmentation at 5 points, while reducing annotation costs and enabling rapid adaptation to new coral environments. These results demonstrate the practical potential of leveraging general foundation-model representations in domain-specific ecological monitoring, especially when expert labeling is expensive or scarce.

Abstract

Marine surveys by robotic underwater and surface vehicles result in substantial quantities of coral reef imagery, however labeling these images is expensive and time-consuming for domain experts. Point label propagation is a technique that uses existing images labeled with sparse points to create augmented ground truth data, which can be used to train a semantic segmentation model. In this work, we show that recent advances in large foundation models facilitate the creation of augmented ground truth masks using only features extracted by the denoised version of the DINOv2 foundation model and K-Nearest Neighbors (KNN), without any pre-training. For images with extremely sparse labels, we use human-in-the-loop principles to enhance annotation efficiency: if there are 5 point labels per image, our method outperforms the prior state-of-the-art by 19.7% for mIoU. When human-in-the-loop labeling is not available, using the denoised DINOv2 features with a KNN still improves on the prior state-of-the-art by 5.8% for mIoU (5 grid points). On the semantic segmentation task, we outperform the prior state-of-the-art by 13.5% for mIoU when only 5 point labels are used for point label propagation. Additionally, we perform a comprehensive study into the number and placement of point labels, and make several recommendations for improving the efficiency of labeling images with points.
Paper Structure (33 sections, 5 equations, 14 figures, 4 tables)

This paper contains 33 sections, 5 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: The proposed point label propagation technique utilizes the DINOv2 foundation model without any fine-tuning to create augmented ground truth masks for intricate coral images. Top left: Previous methods depended on layering superpixels that contained point labels alonso2018semanticalonso2019coralsegpierce2020reducing. Top right: A more recent approach involved pre-training a CNN feature extractor on densely labeled coral images and propagating point labels with a custom, point label aware superpixel function raine2022point. Bottom: In contrast, our approach employs KNN to group features derived from the denoised version of the DINOv2 foundation model without further training on coral imagery.
  • Figure 2: Instance segmentation by Segment Anything 2 ravi2024sam on UCSD Mosaics edwards2017largealonso2019coralseg coral images. This figure shows that Segment Anything 2 does not produce usable segments in these images, and also does not provide any class predictions for the segments. Refer to Fig. \ref{['fig:legend']} for the color legend for the ground truth masks.
  • Figure 3: A comparison of segmentation by DINOv2 oquab2024dinov on UCSD Mosaics edwards2017largealonso2019coralseg coral images (a, b, and c) and an image from the Cityscapes cordts2016cityscapes dataset (d). Note that in example c), the DINOv2 model does not produce any segments. This demonstrates that DINOv2 cannot be directly used in domain-specific applications. This work instead proposes leveraging the deep feature space in combination with a nearest neighbor classifier to perform point label propagation.
  • Figure 4: Schematic of Proposed Human-in-the-Loop Labeling Approach. We combine domain expert knowledge with the model's internal uncertainty to improve point label selection. The process starts with inputting a coral image and having a marine scientist label up to 10 points centrally located on the largest instances (refer to Section \ref{['subsec:ablations']} for an analysis of this value). A feature similarity map is then generated by computing cosine similarities between the labeled points and all other pixels. To promote exploration, we use a distance map in conjunction with the similarity map to create a combined probability mask for pixel selection. The chosen pixel is then sent back to the marine scientist for labeling, and the KNN is updated accordingly. Once the maximum number of points has been labeled, an augmented ground truth mask is created for use in training a semantic segmentation model. The semantic segmentation architecture is trained end-to-end on pairs of images and augmented ground truth masks (as seen in Fig. \ref{['fig:segpipeline']}, which demonstrates that this model is then used to perform inference on new images).
  • Figure 5: Schematic of Full Semantic Segmentation Framework. We combine the human-in-the-loop point label propagation approach (Stage One) with a semantic segmentation architecture (Stage Two). The semantic segmentation architecture is trained on the augmented ground truth masks to enable inference on unlabeled imagery. Refer to Section \ref{['subsec:ablations']} for an analysis of various semantic segmentation architectures.
  • ...and 9 more figures