DREAM: Domain-aware Reasoning for Efficient Autonomous Underwater Monitoring
Zhenqi Wu, Abhinav Modi, Angelos Mavrogiannis, Kaustubh Joshi, Nikhil Chopra, Yiannis Aloimonos, Nare Karapetyan, Ioannis Rekleitis, Xiaomin Lin
TL;DR
DREAM addresses persistent underwater monitoring by combining domain knowledge with a Vision Language Model in a three‑layer architecture: perception, cognitive‑aware planning with Chain‑of‑Thought, and control. It builds a persistent occupancy map from multimodal sensing and uses a VLM‑guided planner to generate high‑level actions that efficiently cover target objects like oysters and shipwrecks while enforcing safety. Across simulations and a real‑world tank test, DREAM outperforms a state‑of‑the‑art baseline and a vanilla VLM approach in time efficiency and coverage, demonstrating practical feasibility for long‑term marine monitoring. This work advances autonomous underwater surveillance by enabling environment‑aware decision making, persistent memory, and robust low‑level control, with open‑source releases planned for datasets, environments, and code.
Abstract
The ocean is warming and acidifying, increasing the risk of mass mortality events for temperature-sensitive shellfish such as oysters. This motivates the development of long-term monitoring systems. However, human labor is costly and long-duration underwater work is highly hazardous, thus favoring robotic solutions as a safer and more efficient option. To enable underwater robots to make real-time, environment-aware decisions without human intervention, we must equip them with an intelligent "brain." This highlights the need for persistent,wide-area, and low-cost benthic monitoring. To this end, we present DREAM, a Vision Language Model (VLM)-guided autonomy framework for long-term underwater exploration and habitat monitoring. The results show that our framework is highly efficient in finding and exploring target objects (e.g., oysters, shipwrecks) without prior location information. In the oyster-monitoring task, our framework takes 31.5% less time than the previous baseline with the same amount of oysters. Compared to the vanilla VLM, it uses 23% fewer steps while covering 8.88% more oysters. In shipwreck scenes, our framework successfully explores and maps the wreck without collisions, requiring 27.5% fewer steps than the vanilla model and achieving 100% coverage, while the vanilla model achieves 60.23% average coverage in our shipwreck environments.
