USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots
Junwen Gu, Zhiheng Wu, Pengxuan Si, Shuang Qiu, Yukai Feng, Luoyang Sun, Laien Luo, Lianyi Yu, Jian Wang, Zhengxing Wu
TL;DR
This work addresses the scarcity of large-scale, multi-task underwater datasets and the need for general-purpose autonomous underwater robots. It introduces USIM, a simulation-based Vision-Language-Action dataset with 561K frames, 1,852 trajectories, and 20 tasks across 9 scenarios, and U0, a VLA model that fuses binocular vision with multi-sensor data and a convolution-attention perception focus (CAP) module to enhance spatial understanding and manipulation. Empirical results show an 80% average success rate across non-grasping tasks and a 21.2% reduction in target distance for mobile grasping when using U0 with binocular input, outperforming a fine-tuned baseline. The work demonstrates that simulation-generated data can train effective underwater VLA models, enabling scalable dataset construction, improved task autonomy, and progress toward general intelligent underwater robots.
Abstract
Underwater environments present unique challenges for robotic operation, including complex hydrodynamics, limited visibility, and constrained communication. Although data-driven approaches have advanced embodied intelligence in terrestrial robots and enabled task-specific autonomous underwater robots, developing underwater intelligence capable of autonomously performing multiple tasks remains highly challenging, as large-scale, high-quality underwater datasets are still scarce. To address these limitations, we introduce USIM, a simulation-based multi-task Vision-Language-Action (VLA) dataset for underwater robots. USIM comprises over 561K frames from 1,852 trajectories, totaling approximately 15.6 hours of BlueROV2 interactions across 20 tasks in 9 diverse scenarios, ranging from visual navigation to mobile manipulation. Building upon this dataset, we propose U0, a VLA model for general underwater robots, which integrates binocular vision and other sensor modalities through multimodal fusion, and further incorporates a convolution-attention-based perception focus enhancement module (CAP) to improve spatial understanding and mobile manipulation. Across tasks such as inspection, obstacle avoidance, scanning, and dynamic tracking, the framework achieves a success rate of 80%, while in challenging mobile manipulation tasks, it reduces the distance to the target by 21.2% compared with baseline methods, demonstrating its effectiveness. USIM and U0 show that VLA models can be effectively applied to underwater robotic applications, providing a foundation for scalable dataset construction, improved task autonomy, and the practical realization of intelligent general underwater robots.
