Table of Contents
Fetching ...

MarineGPT: Unlocking Secrets of Ocean to the Public

Ziqiang Zheng, Jipeng Zhang, Tuan-Anh Vu, Shizhe Diao, Yue Him Wong Tim, Sai-Kit Yeung

TL;DR

<3-5 sentence high-level summary> MarineGPT tackles the gap in domain-specific ocean knowledge by introducing a marine-domain vision-language model built via a two-stage training regime: marine-specific continuous pre-training on the large Marine-5M image-text corpus and instruction-following fine-tuning with 1.12M high-quality multi-modal data. The approach injects rich marine knowledge through 129 attributes, context-rich captions, and 50 marine-centric instructions to enable sensitive, informative, and scientific responses. Empirical results show MarineGPT achieving detailed, scientifically grounded descriptions and accurate species information, with improved fine-grained recognition compared to general-purpose MLLMs, while also highlighting remaining challenges such as hallucinations and the need for larger LLMs. The work also provides a scalable data-generation pipeline and plans to release models and data to support marine biology, biodiversity monitoring, and public engagement in marine science.

Abstract

Large language models (LLMs), such as ChatGPT/GPT-4, have proven to be powerful tools in promoting the user experience as an AI assistant. The continuous works are proposing multi-modal large language models (MLLM), empowering LLMs with the ability to sense multiple modality inputs through constructing a joint semantic space (e.g. visual-text space). Though significant success was achieved in LLMs and MLLMs, exploring LLMs and MLLMs in domain-specific applications that required domain-specific knowledge and expertise has been less conducted, especially for \textbf{marine domain}. Different from general-purpose MLLMs, the marine-specific MLLM is required to yield much more \textbf{sensitive}, \textbf{informative}, and \textbf{scientific} responses. In this work, we demonstrate that the existing MLLMs optimized on huge amounts of readily available general-purpose training data show a minimal ability to understand domain-specific intents and then generate informative and satisfactory responses. To address these issues, we propose \textbf{MarineGPT}, the first vision-language model specially designed for the marine domain, unlocking the secrets of the ocean to the public. We present our \textbf{Marine-5M} dataset with more than 5 million marine image-text pairs to inject domain-specific marine knowledge into our model and achieve better marine vision and language alignment. Our MarineGPT not only pushes the boundaries of marine understanding to the general public but also offers a standard protocol for adapting a general-purpose assistant to downstream domain-specific experts. We pave the way for a wide range of marine applications while setting valuable data and pre-trained models for future research in both academic and industrial communities.

MarineGPT: Unlocking Secrets of Ocean to the Public

TL;DR

<3-5 sentence high-level summary> MarineGPT tackles the gap in domain-specific ocean knowledge by introducing a marine-domain vision-language model built via a two-stage training regime: marine-specific continuous pre-training on the large Marine-5M image-text corpus and instruction-following fine-tuning with 1.12M high-quality multi-modal data. The approach injects rich marine knowledge through 129 attributes, context-rich captions, and 50 marine-centric instructions to enable sensitive, informative, and scientific responses. Empirical results show MarineGPT achieving detailed, scientifically grounded descriptions and accurate species information, with improved fine-grained recognition compared to general-purpose MLLMs, while also highlighting remaining challenges such as hallucinations and the need for larger LLMs. The work also provides a scalable data-generation pipeline and plans to release models and data to support marine biology, biodiversity monitoring, and public engagement in marine science.

Abstract

Large language models (LLMs), such as ChatGPT/GPT-4, have proven to be powerful tools in promoting the user experience as an AI assistant. The continuous works are proposing multi-modal large language models (MLLM), empowering LLMs with the ability to sense multiple modality inputs through constructing a joint semantic space (e.g. visual-text space). Though significant success was achieved in LLMs and MLLMs, exploring LLMs and MLLMs in domain-specific applications that required domain-specific knowledge and expertise has been less conducted, especially for \textbf{marine domain}. Different from general-purpose MLLMs, the marine-specific MLLM is required to yield much more \textbf{sensitive}, \textbf{informative}, and \textbf{scientific} responses. In this work, we demonstrate that the existing MLLMs optimized on huge amounts of readily available general-purpose training data show a minimal ability to understand domain-specific intents and then generate informative and satisfactory responses. To address these issues, we propose \textbf{MarineGPT}, the first vision-language model specially designed for the marine domain, unlocking the secrets of the ocean to the public. We present our \textbf{Marine-5M} dataset with more than 5 million marine image-text pairs to inject domain-specific marine knowledge into our model and achieve better marine vision and language alignment. Our MarineGPT not only pushes the boundaries of marine understanding to the general public but also offers a standard protocol for adapting a general-purpose assistant to downstream domain-specific experts. We pave the way for a wide range of marine applications while setting valuable data and pre-trained models for future research in both academic and industrial communities.
Paper Structure (17 sections, 9 figures)

This paper contains 17 sections, 9 figures.

Figures (9)

  • Figure 1: MarineGPT could perform auto-recognition of various marine objects and yield diverse, domain-specific, informative, and scientific responses associated with the recognized marine object. Best viewed in color.
  • Figure 2: The framework overview of the proposed MarineGPT. There are two main procedures in our MarineGPT: 1) marine-specific continuous pre-training on 5 million marine image-text pairs; 2) instruction-following fine-tuning based on constructed high-quality instruction-following image-text pairs to generate sensitive, informative and scientific responses.
  • Figure 3: The procedure of our attribute-based image description generation pipeline.
  • Figure 4: The comparison between MiniGPT-4, GPT-4V and our MarineGPT. MarineGPT could recognize both the common and scientific names of marine objects and provide diverse information associated with the recognized objects. Best viewed in color.
  • Figure 5: MarineGPT could recognize a wide range of marine objects and yield comprehensive marine and biological knowledge delivered to the users so that the users could obtain a full understanding of the recognized marine objects.
  • ...and 4 more figures