Human-in-the-Loop Segmentation of Multi-species Coral Imagery
Scarlett Raine, Ross Marchant, Brano Kusy, Frederic Maire, Niko Suenderhauf, Tobias Fischer
TL;DR
This work tackles the bottleneck of annotating multi-species coral imagery for semantic segmentation by proposing a two-stage pipeline that first propagates very sparse point labels using $768$-dimensional embeddings from the denoised DINOv2 backbone with a $k=1$ nearest-neighbor rule to generate augmented ground truth, and then trains segmentation models on these masks. A novel human-in-the-loop labeling strategy guides the selection of informative labeling points by combining feature-based uncertainty with spatial exploration, leading to large gains at 5–25 point budgets. The method achieves up to $+19.7\%$ mIoU improvement over prior state-of-the-art for Stage One and $+13.5\%$ mIoU (and $+8.8\%$ PA) for end-to-end segmentation at 5 points, while reducing annotation costs and enabling rapid adaptation to new coral environments. These results demonstrate the practical potential of leveraging general foundation-model representations in domain-specific ecological monitoring, especially when expert labeling is expensive or scarce.
Abstract
Marine surveys by robotic underwater and surface vehicles result in substantial quantities of coral reef imagery, however labeling these images is expensive and time-consuming for domain experts. Point label propagation is a technique that uses existing images labeled with sparse points to create augmented ground truth data, which can be used to train a semantic segmentation model. In this work, we show that recent advances in large foundation models facilitate the creation of augmented ground truth masks using only features extracted by the denoised version of the DINOv2 foundation model and K-Nearest Neighbors (KNN), without any pre-training. For images with extremely sparse labels, we use human-in-the-loop principles to enhance annotation efficiency: if there are 5 point labels per image, our method outperforms the prior state-of-the-art by 19.7% for mIoU. When human-in-the-loop labeling is not available, using the denoised DINOv2 features with a KNN still improves on the prior state-of-the-art by 5.8% for mIoU (5 grid points). On the semantic segmentation task, we outperform the prior state-of-the-art by 13.5% for mIoU when only 5 point labels are used for point label propagation. Additionally, we perform a comprehensive study into the number and placement of point labels, and make several recommendations for improving the efficiency of labeling images with points.
