Table of Contents
Fetching ...

WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar

Runwei Guan, Liye Jia, Fengyufan Yang, Shanliang Yao, Erick Purwanto, Xiaohui Zhu, Eng Gee Lim, Jeremy Smith, Ka Lok Man, Xuming Hu, Yutao Yue

TL;DR

The paper tackles waterway perception for unmanned surface vehicles by introducing WaterVG, a multimodal visual grounding dataset that pairs RGB imagery with 4D mmWave radar and language prompts describing multiple targets. It presents Potamoi, a low-power, one-stage multi-task model that uses Phased Heterogeneous Modality Fusion (PHMF) with Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA) to fuse visual, radar, and textual cues for both Referring Expression Comprehension and Segmentation. WaterVG enables fine-grained, prompt-driven grounding in challenging water environments, and Potamoi achieves competitive accuracy while drastically reducing power consumption, highlighting its suitability for energy-constrained USV operation. The work advances practical multimodal perception for waterway tasks and provides a dataset and model that can influence real-time navigation, rescue, and environmental monitoring applications in aquatic settings.

Abstract

The perception of waterways based on human intent is significant for autonomous navigation and operations of Unmanned Surface Vehicles (USVs) in water environments. Inspired by visual grounding, we introduce WaterVG, the first visual grounding dataset designed for USV-based waterway perception based on human prompts. WaterVG encompasses prompts describing multiple targets, with annotations at the instance level including bounding boxes and masks. Notably, WaterVG includes 11,568 samples with 34,987 referred targets, whose prompts integrates both visual and radar characteristics. The pattern of text-guided two sensors equips a finer granularity of text prompts with visual and radar features of referred targets. Moreover, we propose a low-power visual grounding model, Potamoi, which is a multi-task model with a well-designed Phased Heterogeneous Modality Fusion (PHMF) mode, including Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA). Exactly, ARW extracts required radar features to fuse with vision for prompt alignment. MHSCA is an efficient fusion module with a remarkably small parameter count and FLOPs, elegantly fusing scenario context captured by two sensors with linguistic features, which performs expressively on visual grounding tasks. Comprehensive experiments and evaluations have been conducted on WaterVG, where our Potamoi archives state-of-the-art performances compared with counterparts.

WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar

TL;DR

The paper tackles waterway perception for unmanned surface vehicles by introducing WaterVG, a multimodal visual grounding dataset that pairs RGB imagery with 4D mmWave radar and language prompts describing multiple targets. It presents Potamoi, a low-power, one-stage multi-task model that uses Phased Heterogeneous Modality Fusion (PHMF) with Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA) to fuse visual, radar, and textual cues for both Referring Expression Comprehension and Segmentation. WaterVG enables fine-grained, prompt-driven grounding in challenging water environments, and Potamoi achieves competitive accuracy while drastically reducing power consumption, highlighting its suitability for energy-constrained USV operation. The work advances practical multimodal perception for waterway tasks and provides a dataset and model that can influence real-time navigation, rescue, and environmental monitoring applications in aquatic settings.

Abstract

The perception of waterways based on human intent is significant for autonomous navigation and operations of Unmanned Surface Vehicles (USVs) in water environments. Inspired by visual grounding, we introduce WaterVG, the first visual grounding dataset designed for USV-based waterway perception based on human prompts. WaterVG encompasses prompts describing multiple targets, with annotations at the instance level including bounding boxes and masks. Notably, WaterVG includes 11,568 samples with 34,987 referred targets, whose prompts integrates both visual and radar characteristics. The pattern of text-guided two sensors equips a finer granularity of text prompts with visual and radar features of referred targets. Moreover, we propose a low-power visual grounding model, Potamoi, which is a multi-task model with a well-designed Phased Heterogeneous Modality Fusion (PHMF) mode, including Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA). Exactly, ARW extracts required radar features to fuse with vision for prompt alignment. MHSCA is an efficient fusion module with a remarkably small parameter count and FLOPs, elegantly fusing scenario context captured by two sensors with linguistic features, which performs expressively on visual grounding tasks. Comprehensive experiments and evaluations have been conducted on WaterVG, where our Potamoi archives state-of-the-art performances compared with counterparts.
Paper Structure (18 sections, 8 equations, 10 figures, 6 tables)

This paper contains 18 sections, 8 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview of our pipeline for WaterVG. The camera and radar align prompt on appearance, motion and distance features.
  • Figure 2: Samples in WaterVG. The annotations include bounding boxes and masks, supplemented by radar point clouds (red).
  • Figure 3: The statistics of WaterVG, including proportion on the proportion of waterways (a), sentence patterns (b) and query types (c), and the statistics of WaterVG on referred target number (d), category distribution (e), prompt length (f) and word cloud (g).
  • Figure 4: Five-step prompt annotations of WaterVG.
  • Figure 5: The architecture of Potamoi. The text encoder (ALBERT) lan2020albert for text prompts is frozen during training.
  • ...and 5 more figures