Table of Contents
Fetching ...

HideAndSeg: an AI-based tool with automated prompting for octopus segmentation in natural habitats

Alan de Aguiar, Michaella Pereira Andrade, Charles Morphy D. Santos, João Paulo Gois

TL;DR

HideAndSeg tackles the challenge of segmenting octopuses in natural underwater videos by integrating SAM2 with a specialized YOLOv11 detector and introducing unsupervised metrics $DICE_t$ and $NC_t$ to guide mask quality without ground-truth labels. The method starts with minimal manual annotation to seed SAM2, trains YOLO on resulting segmentation boxes, and then uses YOLO detections to automate SAM2 prompts for full video segmentation. Results show high temporal consistency ($DICE_t$ ~ 0.97) and low fragmentation ($NC_t$ ~ 2.2), with YOLO achieving strong detection metrics (mAP@50 ≈ 0.971, mAP@50–95 ≈ 0.872), indicating robust performance under camouflage and occlusion. The approach enables scalable, automated behavioral analysis of wild cephalopods and suggests a path toward applying similar unsupervised-guided prompts to other wildlife in challenging habitats.

Abstract

Analyzing octopuses in their natural habitats is challenging due to their camouflage capability, rapid changes in skin texture and color, non-rigid body deformations, and frequent occlusions, all of which are compounded by variable underwater lighting and turbidity. Addressing the lack of large-scale annotated datasets, this paper introduces HideAndSeg, a novel, minimally supervised AI-based tool for segmenting videos of octopuses. It establishes a quantitative baseline for this task. HideAndSeg integrates SAM2 with a custom-trained YOLOv11 object detector. First, the user provides point coordinates to generate the initial segmentation masks with SAM2. These masks serve as training data for the YOLO model. After that, our approach fully automates the pipeline by providing a bounding box prompt to SAM2, eliminating the need for further manual intervention. We introduce two unsupervised metrics - temporal consistency $DICE_t$ and new component count $NC_t$ - to quantitatively evaluate segmentation quality and guide mask refinement in the absence of ground-truth data, i.e., real-world information that serves to train, validate, and test AI models. Results show that HideAndSeg achieves satisfactory performance, reducing segmentation noise compared to the manually prompted approach. Our method can re-identify and segment the octopus even after periods of complete occlusion in natural environments, a scenario in which the manually prompted model fails. By reducing the need for manual analysis in real-world scenarios, this work provides a practical tool that paves the way for more efficient behavioral studies of wild cephalopods.

HideAndSeg: an AI-based tool with automated prompting for octopus segmentation in natural habitats

TL;DR

HideAndSeg tackles the challenge of segmenting octopuses in natural underwater videos by integrating SAM2 with a specialized YOLOv11 detector and introducing unsupervised metrics and to guide mask quality without ground-truth labels. The method starts with minimal manual annotation to seed SAM2, trains YOLO on resulting segmentation boxes, and then uses YOLO detections to automate SAM2 prompts for full video segmentation. Results show high temporal consistency ( ~ 0.97) and low fragmentation ( ~ 2.2), with YOLO achieving strong detection metrics (mAP@50 ≈ 0.971, mAP@50–95 ≈ 0.872), indicating robust performance under camouflage and occlusion. The approach enables scalable, automated behavioral analysis of wild cephalopods and suggests a path toward applying similar unsupervised-guided prompts to other wildlife in challenging habitats.

Abstract

Analyzing octopuses in their natural habitats is challenging due to their camouflage capability, rapid changes in skin texture and color, non-rigid body deformations, and frequent occlusions, all of which are compounded by variable underwater lighting and turbidity. Addressing the lack of large-scale annotated datasets, this paper introduces HideAndSeg, a novel, minimally supervised AI-based tool for segmenting videos of octopuses. It establishes a quantitative baseline for this task. HideAndSeg integrates SAM2 with a custom-trained YOLOv11 object detector. First, the user provides point coordinates to generate the initial segmentation masks with SAM2. These masks serve as training data for the YOLO model. After that, our approach fully automates the pipeline by providing a bounding box prompt to SAM2, eliminating the need for further manual intervention. We introduce two unsupervised metrics - temporal consistency and new component count - to quantitatively evaluate segmentation quality and guide mask refinement in the absence of ground-truth data, i.e., real-world information that serves to train, validate, and test AI models. Results show that HideAndSeg achieves satisfactory performance, reducing segmentation noise compared to the manually prompted approach. Our method can re-identify and segment the octopus even after periods of complete occlusion in natural environments, a scenario in which the manually prompted model fails. By reducing the need for manual analysis in real-world scenarios, this work provides a practical tool that paves the way for more efficient behavioral studies of wild cephalopods.

Paper Structure

This paper contains 14 sections, 2 equations, 5 figures.

Figures (5)

  • Figure 1: Failure cases for models when applied to octopus videos. (A) YOLO incorrectly labels the coral reef as a “giraffe” and fails to detect the octopus on the right side of the frame; (B) In SAM2, a fish crossing in front of the octopus causes the segmentation mask (a purple tint) to leak into the fish; (C) Also in SAM2, camouflage and water conditions cause the surrounding environment to leak into the segmentation mask.
  • Figure 2: HideAndSeg pipeline. The input video is first processed through frame extraction. The first clear frame is manually annotated to provide an initial prompt for SAM2, which then generates segmentation masks that are evaluated using our proposed unsupervised metrics. For additional manual annotation, one can select the frame with the lowest metric score. Once the process is complete, the resulting masks are used to train a YOLO-based object segmentation model that ultimately replaces manual prompt annotation for SAM2, resulting in a fully automated segmentation process.
  • Figure 3: Example of three consecutive frames on how to use YOLO in conjunction with SAM2. (A) SAM2 initially fails to recognize the octopus, producing a noisy, speckled segmentation mask before eventually generating a coherent result; (B) The specialized YOLO model successfully detects the octopus from the very first frame; (C) When the YOLO detections are used as prompts for SAM2, accurate segmentation masks are produced from the beginning of the sequence. Thus, we infer that the target object was not abruptly lost during processing; any degradation likely occurred gradually or along the segmentation boundaries.
  • Figure 4: YOLO successfully detects the octopus on the right side of the image after being trained on a specialized dataset.
  • Figure 5: Variation in the $NC_t$ metric across a test video using both methods of the proposed pipeline. Initial segmentation: (A) Initially, the model produces a coherent mask, resulting in a low $NC_t$ value; (B) In the central section of the video, the octopus hides behind rocks. This behavior degrades mask quality as parts of the surrounding environment begin to be erroneously included; (C) Although the octopus becomes visible again, the noise introduced earlier prevents successful recognition, leading to an empty mask. Fully automated segmentation; (D) Initially, the method produces a coherent mask with low $NC_t$; (E) When the octopus hides, YOLO fails to detect it, resulting in an empty mask; (F) Once the octopus reappears, YOLO successfully detects it and prompts SAM2 to generate accurate masks again.