Table of Contents
Fetching ...

CORAL: COntextual Reasoning And Local Planning in A Hierarchical VLM Framework for Underwater Monitoring

Zhenqi Wu, Yuanjie Lu, Xuesu Xiao, Xiaomin Lin

Abstract

Oyster reefs are critical ecosystem species that sustain biodiversity, filter water, and protect coastlines, yet they continue to decline globally. Restoring these ecosystems requires regular underwater monitoring to assess reef health, a task that remains costly, hazardous, and limited when performed by human divers. Autonomous underwater vehicles (AUVs) offer a promising alternative, but existing AUVs rely on geometry-based navigation that cannot interpret scene semantics. Recent vision-language models (VLMs) enable semantic reasoning for intelligent exploration, but existing VLM-driven systems adopt an end-to-end paradigm, introducing three key limitations. First, these systems require the VLM to generate every navigation decision, forcing frequent waits for inference. Second, VLMs cannot model robot dynamics, causing collisions in cluttered environments. Third, limited self-correction allows small deviations to accumulate into large path errors. To address these limitations, we propose CORAL, a framework that decouples high-level semantic reasoning from low-level reactive control. The VLM provides high-level exploration guidance by selecting waypoints, while a dynamics-based planner handles low-level collision-free execution. A geometric verification module validates waypoints and triggers replanning when needed. Compared with the previous state-of-the-art, CORAL improves coverage by 14.28% percentage points, or 17.85% relatively, reduces collisions by 100%, and requires 57% fewer VLM calls.

CORAL: COntextual Reasoning And Local Planning in A Hierarchical VLM Framework for Underwater Monitoring

Abstract

Oyster reefs are critical ecosystem species that sustain biodiversity, filter water, and protect coastlines, yet they continue to decline globally. Restoring these ecosystems requires regular underwater monitoring to assess reef health, a task that remains costly, hazardous, and limited when performed by human divers. Autonomous underwater vehicles (AUVs) offer a promising alternative, but existing AUVs rely on geometry-based navigation that cannot interpret scene semantics. Recent vision-language models (VLMs) enable semantic reasoning for intelligent exploration, but existing VLM-driven systems adopt an end-to-end paradigm, introducing three key limitations. First, these systems require the VLM to generate every navigation decision, forcing frequent waits for inference. Second, VLMs cannot model robot dynamics, causing collisions in cluttered environments. Third, limited self-correction allows small deviations to accumulate into large path errors. To address these limitations, we propose CORAL, a framework that decouples high-level semantic reasoning from low-level reactive control. The VLM provides high-level exploration guidance by selecting waypoints, while a dynamics-based planner handles low-level collision-free execution. A geometric verification module validates waypoints and triggers replanning when needed. Compared with the previous state-of-the-art, CORAL improves coverage by 14.28% percentage points, or 17.85% relatively, reduces collisions by 100%, and requires 57% fewer VLM calls.
Paper Structure (35 sections, 12 equations, 4 figures, 3 tables, 3 algorithms)

This paper contains 35 sections, 12 equations, 4 figures, 3 tables, 3 algorithms.

Figures (4)

  • Figure 1: Example of CORAL deployed in the real world on a BlueROV surveying an oyster reef in a pool.
  • Figure 2: Overview of the CORAL framework. The perception module fuses segmentation and depth images into an occupancy map with labeled cluster centroids. The high-level planner invokes the VLM only when the smart trigger fires; a self-verification module rejects invalid waypoints with corrective feedback. Verified waypoints are executed by the MDDP low-level controller.
  • Figure 3: Trajectory comparison across 5 different topology representative environments. Blue: DREAM; orange: CORAL w/o Low-Level Planner; yellow: CORAL (Full). DREAM's trajectories frequently deviate from oyster clusters and exhibit erratic motion due to per-step VLM control. CORAL w/o Low-Level Planner follows the centroid chain more closely but produces jagged paths from PD control limitations. CORAL (Full) achieves the smoothest, most coverage-efficient trajectories by combining VLM-guided waypoint selection with DDP-augmented MPPI execution.
  • Figure 4: Example of a real-world environment and VLM output