A Multi-Modal Approach Based on Large Vision Model for Close-Range Underwater Target Localization

Mingyang Yang; Zeyu Sha; Feitian Zhang

A Multi-Modal Approach Based on Large Vision Model for Close-Range Underwater Target Localization

Mingyang Yang, Zeyu Sha, Feitian Zhang

TL;DR

A novel target localization method is proposed that assimilates both optical and acoustic sensory measurements to estimate the 3-D positions of close-range underwater targets, and integrates a large vision model with unique acoustic-based model prompt design to process multimodal sensor measurements, ensuring the generalizability and robustness of underwater target localization.

Abstract

Underwater target localization uses real-time sensory measurements to estimate the position of underwater objects of interest, providing critical feedback information for underwater robots. While acoustic sensing is the most acknowledged method in underwater robots and possibly the only effective approach for long-range underwater target localization, such a sensing modality generally suffers from low resolution, high cost and high energy consumption, thus leading to a mediocre performance when applied to close-range underwater target localization. On the other hand, optical sensing has attracted increasing attention in the underwater robotics community for its advantages of high resolution and low cost, holding a great potential particularly in close-range underwater target localization. However, most existing studies in underwater optical sensing are restricted to specific types of targets due to the limited training data available. In addition, these studies typically focus on the design of estimation algorithms and ignore the influence of illumination conditions on the sensing performance, thus hindering wider applications in the real world. To address the aforementioned issues, this paper proposes a novel target localization method that assimilates both optical and acoustic sensory measurements to estimate the 3D positions of close-range underwater targets. A test platform with controllable illumination conditions is designed and developed to experimentally investigate the proposed multi-modal sensing approach. A large vision model is applied to process the optical imaging measurements, eliminating the requirement for training data acquisition, thus significantly expanding the scope of potential applications. Extensive experiments are conducted, the results of which validate the effectiveness of the proposed underwater target localization method.

A Multi-Modal Approach Based on Large Vision Model for Close-Range Underwater Target Localization

TL;DR

Abstract

Paper Structure (19 sections, 22 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 19 sections, 22 equations, 12 figures, 4 tables, 1 algorithm.

Introduction
Test Platform
Sensor Model Preliminaries
Pinhole Camera Model
Distortion Model
Binocular Camera Model
Ultrasonic Ranging Sensor Model
Underwater Target Localization Design
Target Detection With Large Vision Model and Ranging Sensor Prompts
Multi-modal Target Localization
Target ranging with the weighted averaging filter
Target motion state estimation using EKF
Experiments
Implementation Setup
Experimental Results
...and 4 more sections

Figures (12)

Figure 1: The schematic of the proposed multi-modal close-range target localization framework for underwater robots. When an underwater target appears within the sensor measurement range, multiple optical and acoustic sensors equipped onboard the underwater robot collaboratively estimate the motion states of the target of interest.
Figure 2: Illustration of the test platform and the relevant components.
Figure 3: Layout of the 11 test scenes. Scene 1 includes 11 targets placed at a distance of 0.5 m from the sensing module. Scene 2 includes 10 targets placed at a distance of 0.55 m. Scene 3 includes 9 targets at a distance of 0.6 m. Scenes 4 and 5 include 2 and 3 targets of dynamic motion, respectively. Scenes 6-8 incorporate the same set of aquatic life model targets shown in Fig. \ref{['marinelife']} but setup at different distances. Scenes 6-8 are placed at a distance of 0.5 m, 0.55 m and 0.6 m, respectively. Scenes 9-11 are used for one-shot prompt locating process. Scene 1 shows the paired left and right view images of the binocular camera while Scenes 2-11 show only the left view images. All targets in Scenes 4-5 are included in Scenes 1-3.
Figure 4: Left view images acquired in Scene 3 under 4 lux, 6 lux, 8 lux, 10 lux, 12 lux and 25 lux illumination conditions are demonstrated. Image under 2 lux illumination is not covered since it is barely visually distinguishable with its 4 lux counterpart. The gradual increment in image brightness from Fig. \ref{['4-lux']} to Fig. \ref{['25-lux']} is visually observable.
Figure 5: Illustration of the segmentation experimental results using the large vision model --- SAM with the ranging sensor measurements as prompt inputs. The segmentation masks are superimposed on the original images. Three segmented cube cases in the first row, three segmented sphere cases in the second row and three segmented aquatic life model cases in the third row are demonstrated.
...and 7 more figures

A Multi-Modal Approach Based on Large Vision Model for Close-Range Underwater Target Localization

TL;DR

Abstract

A Multi-Modal Approach Based on Large Vision Model for Close-Range Underwater Target Localization

Authors

TL;DR

Abstract

Table of Contents

Figures (12)