Table of Contents
Fetching ...

SeafloorAI: A Large-scale Vision-Language Dataset for Seafloor Geological Survey

Kien X. Nguyen, Fengchun Qiao, Arthur Trembanis, Xi Peng

Abstract

A major obstacle to the advancements of machine learning models in marine science, particularly in sonar imagery analysis, is the scarcity of AI-ready datasets. While there have been efforts to make AI-ready sonar image dataset publicly available, they suffer from limitations in terms of environment setting and scale. To bridge this gap, we introduce SeafloorAI, the first extensive AI-ready datasets for seafloor mapping across 5 geological layers that is curated in collaboration with marine scientists. We further extend the dataset to SeafloorGenAI by incorporating the language component in order to facilitate the development of both vision- and language-capable machine learning models for sonar imagery. The dataset consists of 62 geo-distributed data surveys spanning 17,300 square kilometers, with 696K sonar images, 827K annotated segmentation masks, 696K detailed language descriptions and approximately 7M question-answer pairs. By making our data processing source code publicly available, we aim to engage the marine science community to enrich the data pool and inspire the machine learning community to develop more robust models. This collaborative approach will enhance the capabilities and applications of our datasets within both fields.

SeafloorAI: A Large-scale Vision-Language Dataset for Seafloor Geological Survey

Abstract

A major obstacle to the advancements of machine learning models in marine science, particularly in sonar imagery analysis, is the scarcity of AI-ready datasets. While there have been efforts to make AI-ready sonar image dataset publicly available, they suffer from limitations in terms of environment setting and scale. To bridge this gap, we introduce SeafloorAI, the first extensive AI-ready datasets for seafloor mapping across 5 geological layers that is curated in collaboration with marine scientists. We further extend the dataset to SeafloorGenAI by incorporating the language component in order to facilitate the development of both vision- and language-capable machine learning models for sonar imagery. The dataset consists of 62 geo-distributed data surveys spanning 17,300 square kilometers, with 696K sonar images, 827K annotated segmentation masks, 696K detailed language descriptions and approximately 7M question-answer pairs. By making our data processing source code publicly available, we aim to engage the marine science community to enrich the data pool and inspire the machine learning community to develop more robust models. This collaborative approach will enhance the capabilities and applications of our datasets within both fields.

Paper Structure

This paper contains 9 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of the spatially distributed seafloor mapping datasets. The table highlights key dataset statistics. We incorporate 62 public data surveys published by USGS and NOAA from 9 major regions to construct SeafloorAI and SeafloorGenAI datasets. Our dataset contains 9 geological layers, 4 of which are raw signals, i.e., Backscatter, Bathymetry, Slope and Rugosity, and 5 annotated by human experts, i.e. Sediment, Physiographic Zone, Habitat, Fault and Fold. SeafloorAI serves as a dataset for standard computer vision tasks, i.e. semantic segmentation, whereas SeafloorGenAI constitutes a dataset for generative vision-language tasks, i.e., general visual question answering and instruction-following mapping. <SEG> denotes the segmentation mask output by the model.
  • Figure 2: The Barnhardt classification scheme Barnhardt98 is based on four end-member units: (R)ock, (G)ravel, (S)and, and (M)ud. The other twelve composite categories represent the combinations of the four units, where the dominant texture ($>$ 50$\%$) is in upper case, and the subordinate ($<$ 50$\%$) in lower.
  • Figure 3: Twenty-one physiographic zone categories from CMECS.
  • Figure 4: Nine major categories for abiotic habitat defined in SeafloorAI.
  • Figure 5: Pipeline for generating question-answer pairs for sonar imagery samples using GPT-4: Marine scientists first identify necessary information, followed by the extraction of geophysical parameters, geological composition, and spatial distribution. They then provide descriptions for a handful of samples from the SeafloorAI dataset. These description are used to design a prompt for GPT-4 to generate high-quality, domain-specific question-answer pairs, via in-context learning brown2020language.
  • ...and 2 more figures