WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents

Runwei Guan; Shaofeng Liang; Ningwei Ouyang; Weichen Fei; Shanliang Yao; Wei Dai; Chenhao Ge; Penglei Sun; Xiaohui Zhu; Tao Huang; Ryan Wen Liu; Hui Xiong

WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents

Runwei Guan, Shaofeng Liang, Ningwei Ouyang, Weichen Fei, Shanliang Yao, Wei Dai, Chenhao Ge, Penglei Sun, Xiaohui Zhu, Tao Huang, Ryan Wen Liu, Hui Xiong

TL;DR

WaterVideoQA is presented, the first large-scale, comprehensive Video Question Answering benchmark specifically engineered for all-waterway environments and NaviMind, a pioneering multi-agent neuro-symbolic system designed for open-ended maritime reasoning is introduced.

Abstract

While autonomous navigation has achieved remarkable success in passive perception (e.g., object detection and segmentation), it remains fundamentally constrained by a void in knowledge-driven, interactive environmental cognition. In the high-stakes domain of maritime navigation, the ability to bridge the gap between raw visual perception and complex cognitive reasoning is not merely an enhancement but a critical prerequisite for Autonomous Surface Vessels to execute safe and precise maneuvers. To this end, we present WaterVideoQA, the first large-scale, comprehensive Video Question Answering benchmark specifically engineered for all-waterway environments. This benchmark encompasses 3,029 video clips across six distinct waterway categories, integrating multifaceted variables such as volatile lighting and dynamic weather to rigorously stress-test ASV capabilities across a five-tier hierarchical cognitive framework. Furthermore, we introduce NaviMind, a pioneering multi-agent neuro-symbolic system designed for open-ended maritime reasoning. By synergizing Adaptive Semantic Routing, Situation-Aware Hierarchical Reasoning, and Autonomous Self-Reflective Verification, NaviMind transitions ASVs from superficial pattern matching to regulation-compliant, interpretable decision-making. Experimental results demonstrate that our framework significantly transcends existing baselines, establishing a new paradigm for intelligent, trustworthy interaction in dynamic maritime environments.

WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents

TL;DR

Abstract

Paper Structure (20 sections, 7 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 20 sections, 7 equations, 8 figures, 5 tables, 1 algorithm.

Introduction
Related Works
Waterway Passive Perception based on Deep Learning
Waterway Understanding with Natural Language
Video Understanding upon Multi-Agent System
WaterVideoQA Dataset
Annotation Process
Statistics
Method
Adaptive Semantic Routing
Domain-Specific Maritime Knowledge Retrieval
Situation-Aware Hierarchical Reasoning
Autonomous Self-Reflective Verification
Experiments
Dataset Settings.
...and 5 more sections

Figures (8)

Figure 1: The overview of pipeline, including (a) the proposed WaterVideoQA dataset; (b) the proposed multi-agent neuro-symbolic reasoning system: NaviMind; (c) the real-word application scenarios for trustworthy navigation guidance.
Figure 2: The statistics of our proposed WaterVideoQA dataset, including (a) Sample Type Distribution, (b) Average Q/A Length by Category, (c) Video Duration by Question Category, (d) Question Category Distribution, (e) Answer Type Distribution and (f) Word Cloud of Q&A.
Figure 3: The overview of annotation process for WaterVideoQA.
Figure 4: The architecture of NaviMind, a multi-agent reasoning system for waterway navigation. NaviMind has two inputs (a user query and a video clip) and an output answer. It has 5 agents for routing, captioning, reasoning, grading and summary.
Figure 5: The workflow of Situation-Aware Hierarchical Reasoning.
...and 3 more figures

WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents

TL;DR

Abstract

WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (8)