Table of Contents
Fetching ...

NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

Wei Xu, Cheng Wang, Dingkang Liang, Zongchuang Zhao, Xingyu Jiang, Peng Zhang, Xiang Bai

TL;DR

Nautilus addresses underwater scene understanding by integrating physics-informed Vision Feature Enhancement (VFE) into established large multimodal bases, enabling eight tasks across image-, region-, and object-level perception. The authors create NautData, a large underwater instruction-following dataset (≈158K images, ≈1.45M QA pairs) to support multi-task tuning and evaluation, and they demonstrate that VFE improves robustness to underwater degradation via depth-aware restoration and backscatter suppression. By fusing original and enhanced vision features into LLM reasoning, Nautilus achieves superior performance across coarse/fine classification, counting, grounding, detection, VQA, and captioning on NautData and MarineInst, including under degraded conditions. This work advances practical underwater exploration by providing a scalable, multi-granular vision-language framework and a substantial benchmark for future domain-specific LMMs.

Abstract

Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS.

NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

TL;DR

Nautilus addresses underwater scene understanding by integrating physics-informed Vision Feature Enhancement (VFE) into established large multimodal bases, enabling eight tasks across image-, region-, and object-level perception. The authors create NautData, a large underwater instruction-following dataset (≈158K images, ≈1.45M QA pairs) to support multi-task tuning and evaluation, and they demonstrate that VFE improves robustness to underwater degradation via depth-aware restoration and backscatter suppression. By fusing original and enhanced vision features into LLM reasoning, Nautilus achieves superior performance across coarse/fine classification, counting, grounding, detection, VQA, and captioning on NautData and MarineInst, including under degraded conditions. This work advances practical underwater exploration by providing a scalable, multi-granular vision-language framework and a substantial benchmark for future domain-specific LMMs.

Abstract

Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS.

Paper Structure

This paper contains 19 sections, 4 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: The underwater environment presents a visually rich and dynamically evolving landscape. Nautilus addresses eight diverse underwater tasks, encompassing coarse-grained classification, fine-grained classification, counting, visual question answering (VQA), detection, grounding, region caption, and image caption, enabling comprehensive understandings across multiple granularities.
  • Figure 2: Illustration of the data construction framework. Eight tasks are involved, and the data generation process is tailored to each task. Rule-based generation utilizes predefined templates to generate question-answer pairs. Integration generation integrates question-answer pairs using both templates and outputs from LMMs. Free-form generation enables LMMs to construct questions and answers based on the content they focus on.
  • Figure 3: The framework of Nautilus. Inspired by underwater physical priors zhou2023underwaterakkaynak2018revisednathan2024osmosis, we sample dark pixels to quantify the responses of underwater degradation. The vision feature enhancement (VFE) module improves underwater LMMs with depth information as auxiliary information. Outputs of the image encoder and the VFE module are fed into an LLM to facilitate multimodal processing.
  • Figure 4: The structure of the vision feature enhancement (VFE) module. The inputs consist of the vision feature, the index of the dark pixel, and the depth feature. It outputs enhanced vision features capturing restored underwater information.
  • Figure 5: Qualitative results on underwater scene understanding. Nautilus perceives image-, region-, and object-level information while addressing eight tasks. Our underwater LMM exhibits remarkable multimodal instruction-following performance, serving as a meaningful contribution to this field.
  • ...and 8 more figures