Table of Contents
Fetching ...

Opti-Acoustic Semantic SLAM with Unknown Objects in Underwater Environments

Kurran Singh, Jungseok Hong, Nicholas R. Rypkema, John J. Leonard

TL;DR

This work tackles underwater semantic SLAM with unknown object classes by fusing monocular camera information and multibeam sonar within a probabilistic, open-set framework. It introduces an object-centric pipeline that (1) segments objects unsupervisedly and extracts fixed-size semantic embeddings, (2) localizes objects via opti-acoustic fusion, and (3) jointly optimizes vehicle pose and object landmarks in a factor-graph using iSAM2. Key contributions include robust data association using latent-code similarities and Mahalanobis gating, a three-way pipeline that handles multimodal sensing under challenging underwater conditions, and publicly released indoor/outdoor underwater datasets with 17 object classes. The results show improved trajectory and map accuracy over baselines and demonstrate loop closures through semantic object re-identification, highlighting practical impact for autonomous underwater operations and diver-AUV collaboration.

Abstract

Despite recent advances in semantic Simultaneous Localization and Mapping (SLAM) for terrestrial and aerial applications, underwater semantic SLAM remains an open and largely unaddressed research problem due to the unique sensing modalities and the object classes found underwater. This paper presents an object-based semantic SLAM method for underwater environments that can identify, localize, classify, and map a wide variety of marine objects without a priori knowledge of the object classes present in the scene. The method performs unsupervised object segmentation and object-level feature aggregation, and then uses opti-acoustic sensor fusion for object localization. Probabilistic data association is used to determine observation to landmark correspondences. Given such correspondences, the method then jointly optimizes landmark and vehicle position estimates. Indoor and outdoor underwater datasets with a wide variety of objects and challenging acoustic and lighting conditions are collected for evaluation and made publicly available. Quantitative and qualitative results show the proposed method achieves reduced trajectory error compared to baseline methods, and is able to obtain comparable map accuracy to a baseline closed-set method that requires hand-labeled data of all objects in the scene.

Opti-Acoustic Semantic SLAM with Unknown Objects in Underwater Environments

TL;DR

This work tackles underwater semantic SLAM with unknown object classes by fusing monocular camera information and multibeam sonar within a probabilistic, open-set framework. It introduces an object-centric pipeline that (1) segments objects unsupervisedly and extracts fixed-size semantic embeddings, (2) localizes objects via opti-acoustic fusion, and (3) jointly optimizes vehicle pose and object landmarks in a factor-graph using iSAM2. Key contributions include robust data association using latent-code similarities and Mahalanobis gating, a three-way pipeline that handles multimodal sensing under challenging underwater conditions, and publicly released indoor/outdoor underwater datasets with 17 object classes. The results show improved trajectory and map accuracy over baselines and demonstrate loop closures through semantic object re-identification, highlighting practical impact for autonomous underwater operations and diver-AUV collaboration.

Abstract

Despite recent advances in semantic Simultaneous Localization and Mapping (SLAM) for terrestrial and aerial applications, underwater semantic SLAM remains an open and largely unaddressed research problem due to the unique sensing modalities and the object classes found underwater. This paper presents an object-based semantic SLAM method for underwater environments that can identify, localize, classify, and map a wide variety of marine objects without a priori knowledge of the object classes present in the scene. The method performs unsupervised object segmentation and object-level feature aggregation, and then uses opti-acoustic sensor fusion for object localization. Probabilistic data association is used to determine observation to landmark correspondences. Given such correspondences, the method then jointly optimizes landmark and vehicle position estimates. Indoor and outdoor underwater datasets with a wide variety of objects and challenging acoustic and lighting conditions are collected for evaluation and made publicly available. Quantitative and qualitative results show the proposed method achieves reduced trajectory error compared to baseline methods, and is able to obtain comparable map accuracy to a baseline closed-set method that requires hand-labeled data of all objects in the scene.
Paper Structure (14 sections, 9 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 9 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: An example scenario, where an underwater vehicle encounters objects during its navigation, demonstrating the proposed semantic SLAM approach. (Top): The vehicle senses objects using both optical and acoustic sensors. Optical sensing information is used to extract semantic feature embeddings of each object, and the ranges to the objects are obtained from the acoustic sensor. (Bottom): Information obtained from objects and odometry is used for building a factor graph ($X_i$: vehicle states; $L_i$: landmark states; $\theta_{Li}, \phi_{Li}, z_{Li}$: bearing, elevation, and ranges to objects; $z_{v}, \phi_v, \psi_v$: depth, pitch, and roll of the vehicle; $x_v, y_v, \theta_v$: location and yaw of the vehicle).
  • Figure 2: The pipeline receives sensor inputs from a multibeam sonar, a monocular camera, an IMU, a DVL, and a pressure sensor. Given those sensor inputs, the system does the following: (1) Segment objects from optical images and extract features for each segmentation mask. The dimension of these extracted features is dependent on the corresponding object's size in the image. These features are projected into a fixed dimension for the data association process. (2) Given the segmentation mask, its pixel centroid is used to find a corresponding range from sonar returns and estimate a 3D position of the object. (3) In parallel, sensor readings from IMU, DVL, and pressure sensor are used to estimate odometry. Outputs from (1)-(3) are used to build a factor graphwhich is optimized via iSAM2 to produce maximum a posteriori (MAP) estimates. Lastly, we obtain map and trajectory estimates as an output.
  • Figure 3: Schematic representation of sensor placements of a camera and sonar. The diagram on the left illustrates the spatial configuration of a camera and sonar sensor with their respective coordinate frames, where $C$ represents a camera frame. The camera's field of view is depicted in blue, and the sonar's conical beam is represented in red. The diagram on the right details the observed object's orientation angles of $\theta, \phi_o$ from the camera and $\phi_s$ from the sonar, depicting how angular measurements are derived.
  • Figure 4: (Top left): Input image of a lobster cage, (Top right): Detected object, (Bottom left): Corresponding sonar frame for object range estimation, (Bottom right): Enhanced frame for visualization. Sonar data can often be noisy and is subject to multi-path effects that make the detection and range extraction of objects challenging. Our method is largely robust to such issues through the use of a probabilistic data association method that takes into account the range uncertainty that is typical with the use of sonar.
  • Figure 5: (Top): The outdoor tank used for data collection exhibits challenging lighting conditions, including reflections (left) and caustics (right). (Bottom): Selected raw (uncalibrated) frames from the vehicle's monocular camera. These images demonstrate the caustics (ripple lighting effect due to surface waves), reflections, and sudden lighting changes make processing underwater imagery particularly difficult compared to terrestrial scenes. Feature-based methods fail to maintain tracking due to the dynamic textures; our method uses objects instead and proves to be robust to such lighting effects.
  • ...and 2 more figures