CaveSeg: Deep Semantic Segmentation and Scene Parsing for Autonomous Underwater Cave Exploration

A. Abdullah; T. Barua; R. Tibbetts; Z. Chen; M. J. Islam; I. Rekleitis

CaveSeg: Deep Semantic Segmentation and Scene Parsing for Autonomous Underwater Cave Exploration

A. Abdullah, T. Barua, R. Tibbetts, Z. Chen, M. J. Islam, I. Rekleitis

TL;DR

CaveSeg addresses the lack of annotated underwater cave data for semantic perception and real-time navigation by introducing CaveSeg dataset and a lightweight transformer-based segmentation model. The dataset spans three major cave systems and includes 13 navigation- and safety-relevant object categories, enabling dense scene parsing and practical planning. Empirical results show competitive performance with significantly reduced memory and faster inference compared to baselines, and the authors demonstrate use cases in safe navigation, diver coordination, and 3D semantic mapping. The work lays a foundation for vision-based autonomous exploration and mapping of underwater caves, with future directions including tighter geometry-semantics fusion and expanded label sets.

Abstract

In this paper, we present CaveSeg - the first visual learning pipeline for semantic segmentation and scene parsing for AUV navigation inside underwater caves. We address the problem of scarce annotated training data by preparing a comprehensive dataset for semantic segmentation of underwater cave scenes. It contains pixel annotations for important navigation markers (e.g. caveline, arrows), obstacles (e.g. ground plane and overhead layers), scuba divers, and open areas for servoing. Through comprehensive benchmark analyses on cave systems in USA, Mexico, and Spain locations, we demonstrate that robust deep visual models can be developed based on CaveSeg for fast semantic scene parsing of underwater cave environments. In particular, we formulate a novel transformer-based model that is computationally light and offers near real-time execution in addition to achieving state-of-the-art performance. Finally, we explore the design choices and implications of semantic segmentation for visual servoing by AUVs inside underwater caves. The proposed model and benchmark dataset open up promising opportunities for future research in autonomous underwater cave exploration and mapping.

CaveSeg: Deep Semantic Segmentation and Scene Parsing for Autonomous Underwater Cave Exploration

TL;DR

Abstract

Paper Structure (19 sections, 1 equation, 9 figures, 2 tables)

This paper contains 19 sections, 1 equation, 9 figures, 2 tables.

Introduction and Background
Related Work
Underwater Cave Exploration and Mapping
Semantic Segmentation of Underwater Scenes
CaveSeg Dataset: Data Preparation and Problem Formulation
CaveSeg Model: Semantic Scene Segmentation Of Underwater Caves
Network Design and Learning Pipeline
CaveSeg Model Architecture
Supervised Learning Pipeline
Training Setups for CaveSeg & Baseline Models
Performance Analyses of CaveSeg
Quantitative Evaluation
Qualitative Evaluation
Use Cases: Vision-based Cave Exploration and Semantic Mapping by AUVs
Safe AUV Navigation Inside Underwater Caves
...and 4 more sections

Figures (9)

Figure 1: (a) A tethered BlueROV2 is operating inside Orange Grove underwater cave system in FL, USA; it is teleoperated by a surface operator following the caveline as a navigation guide; (b) the corresponding POV from the robot's camera; (c) the proposed semantic parsing concept is shown; the envisioned capabilities are: first-layer & second-layer obstacle avoidance, ground plane estimation, and caveline detection, following, and 3D estimation -- to enable autonomous robot navigation inside underwater caves.
Figure 2: A few sample images from the proposed CaveSeg dataset, corresponding ground truth labels, and their overlayed visualizations are shown; color codes for each object category are listed on the right.
Figure 3: Frequencies and distributions of important object categories are shown in the train, validation, and test sets.
Figure 4: The network architecture of our proposed CaveSeg model is shown. Input images are partitioned into $4\times4$ patches and fed into a four-stage transformer backbone for coarse-to-fine feature extraction. The extracted multi-scale features are then pooled and combined by the PPM head for bottom-up and top-down feature aggregation. A hierarchical feature map is then compiled by merging several multi-level feature representations. On this feature space, a classifier performs pixel-level semantic segmentation to generate the final outputs.
Figure 5: A few qualitative performance comparisons of all models on CaveSeg-Challenge test set are shown (results for only five top-performing models are shown for clarity). Note that the object detection and localization accuracy for categories such as caveline, open area, and navigation markers are particularly important for AUV navigation.
...and 4 more figures

CaveSeg: Deep Semantic Segmentation and Scene Parsing for Autonomous Underwater Cave Exploration

TL;DR

Abstract

CaveSeg: Deep Semantic Segmentation and Scene Parsing for Autonomous Underwater Cave Exploration

Authors

TL;DR

Abstract

Table of Contents

Figures (9)