Table of Contents
Fetching ...

A General-Purpose Diversified 2D Seismic Image Dataset from NAMSS

Lucas de Magalhães Araujo, Otávio Oliveira Napoli, Sandra Avila, Edson Borin

TL;DR

The paper addresses the need for diverse, large-scale 2D seismic image datasets to enable machine learning in geophysics. It presents Unicamp-NAMSS, a balanced collection of 2D migrated seismic sections extracted from NAMSS, organized into region-disjoint macro-regions for training, validation, and testing, and saved as TIFF images after careful preprocessing. Through embedding-based analyses with ResNet-50 and DINOv2 ViT-B/14, plus comparisons to Parihaka and F3, the work demonstrates substantial intra- and inter-regional variability and broad coverage of the seismic appearance space, making the dataset well suited for self-supervised pretraining, transfer learning, and domain adaptation studies. The dataset, along with open-source preprocessing and analysis code, provides a valuable resource for benchmarking and developing models for tasks such as super-resolution, denoising, and attribute prediction, with explicit safeguards against data leakage via region-disjoint splits.

Abstract

We introduce the Unicamp-NAMSS dataset, a large, diverse, and geographically distributed collection of migrated 2D seismic sections designed to support modern machine learning research in geophysics. We constructed the dataset from the National Archive of Marine Seismic Surveys (NAMSS), which contains decades of publicly available marine seismic data acquired across multiple regions, acquisition conditions, and geological settings. After a comprehensive collection and filtering process, we obtained 2588 cleaned and standardized seismic sections from 122 survey areas, covering a wide range of vertical and horizontal sampling characteristics. To ensure reliable experimentation, we balanced the dataset so that no survey dominates the distribution, and partitioned it into non-overlapping macro-regions for training, validation, and testing. This region-disjoint split allows robust evaluation of generalization to unseen geological and acquisition conditions. We validated the dataset through quantitative and embedding-space analyses using both convolutional and transformer-based models. These analyses showed that Unicamp-NAMSS exhibits substantial variability within and across regions, while maintaining coherent structure across acquisition macro-region and survey types. Comparisons with widely used interpretation datasets (Parihaka and F3 Block) further demonstrated that Unicamp-NAMSS covers a broader portion of the seismic appearance space, making it a strong candidate for machine learning model pretraining. The dataset, therefore, provides a valuable resource for machine learning tasks, including self-supervised representation learning, transfer learning, benchmarking supervised tasks such as super-resolution or attribute prediction, and studying domain adaptation in seismic interpretation.

A General-Purpose Diversified 2D Seismic Image Dataset from NAMSS

TL;DR

The paper addresses the need for diverse, large-scale 2D seismic image datasets to enable machine learning in geophysics. It presents Unicamp-NAMSS, a balanced collection of 2D migrated seismic sections extracted from NAMSS, organized into region-disjoint macro-regions for training, validation, and testing, and saved as TIFF images after careful preprocessing. Through embedding-based analyses with ResNet-50 and DINOv2 ViT-B/14, plus comparisons to Parihaka and F3, the work demonstrates substantial intra- and inter-regional variability and broad coverage of the seismic appearance space, making the dataset well suited for self-supervised pretraining, transfer learning, and domain adaptation studies. The dataset, along with open-source preprocessing and analysis code, provides a valuable resource for benchmarking and developing models for tasks such as super-resolution, denoising, and attribute prediction, with explicit safeguards against data leakage via region-disjoint splits.

Abstract

We introduce the Unicamp-NAMSS dataset, a large, diverse, and geographically distributed collection of migrated 2D seismic sections designed to support modern machine learning research in geophysics. We constructed the dataset from the National Archive of Marine Seismic Surveys (NAMSS), which contains decades of publicly available marine seismic data acquired across multiple regions, acquisition conditions, and geological settings. After a comprehensive collection and filtering process, we obtained 2588 cleaned and standardized seismic sections from 122 survey areas, covering a wide range of vertical and horizontal sampling characteristics. To ensure reliable experimentation, we balanced the dataset so that no survey dominates the distribution, and partitioned it into non-overlapping macro-regions for training, validation, and testing. This region-disjoint split allows robust evaluation of generalization to unseen geological and acquisition conditions. We validated the dataset through quantitative and embedding-space analyses using both convolutional and transformer-based models. These analyses showed that Unicamp-NAMSS exhibits substantial variability within and across regions, while maintaining coherent structure across acquisition macro-region and survey types. Comparisons with widely used interpretation datasets (Parihaka and F3 Block) further demonstrated that Unicamp-NAMSS covers a broader portion of the seismic appearance space, making it a strong candidate for machine learning model pretraining. The dataset, therefore, provides a valuable resource for machine learning tasks, including self-supervised representation learning, transfer learning, benchmarking supervised tasks such as super-resolution or attribute prediction, and studying domain adaptation in seismic interpretation.
Paper Structure (43 sections, 1 equation, 6 figures)

This paper contains 43 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: Example of using the NAMSS search tool in the Bering Sea and Gulf of Alaska region. On the left are highlighted examples of search criteria by data type and the filter button, which applies the filter criteria only to data within the visible region. The search result is illustrated on the map as pink lines, representing the seismic lines of each survey. On the right, a link to download a CSV file with information about the surveys displayed on the map is highlighted. Source: https://walrus.wr.usgs.gov/namss/search.
  • Figure 2: Volume of available migrated data in 143 surveys, ordered by volume. The vertical dashed line shows the point that divides 50 % of the total volume. The horizontal dashed line shows the 300 MB threshold used to truncate the maximum amount of data collected from each survey.
  • Figure 3: Geographic distribution of the 122 survey areas included in the Unicamp-NAMSS dataset. Each bounding box corresponds to the coverage of a seismic survey. The surveys are divided into Training (blue), Validation (green), and Test (red) subsets, arranged into nine macro-regions with no spatial overlap. The figure also reports, for each macro-region, the number of surveys and the total amount of data it contributes to the dataset.
  • Figure 4: Summary of dataset characteristics for the Training, Validation, and Test subsets. Graphs (a) and (b) describe the 122 surveys, while graphs (c) and (d) summarize the 2 588 seismic sections. (a) Distribution of survey acquisition years, with 90 % conducted between 1975 and 1985. (b) Average trace spacing (dx) per survey; although all data share a fixed temporal sampling rate of 4 ms, lateral sampling (distance between traces, dx) varies widely, with 12.5 m, 25 m, 33 m (110 ft), and 50 m accounting for 70 % of the data; 14 surveys lack this information. (c) Number of samples per trace, with 89 % of data between 1 000 and 2 000 samples or 4 s to 8 s. (d) Number of traces per sample, with 80 % between 400 and 3 200 traces.
  • Figure 5: UMAP projections of NAMSS embeddings extracted using two pretrained models. Top row: ResNet-50 pretrained on COCO. Bottom row: DINOv2 ViT-B/14. Left column: samples colored by dataset split. Right column: samples colored by acquisition macro-region. Training data span a wide region of embedding space, while validation and test samples appear more concentrated. No clear clustering by acquisition macro-region is observed.
  • ...and 1 more figures