Training a generalizable diffusion model for seismic data processing using a large-scale open-source waveform dataset

Xinyue Gong; Sergey Fomel; Yangkang Chen

Training a generalizable diffusion model for seismic data processing using a large-scale open-source waveform dataset

Xinyue Gong, Sergey Fomel, Yangkang Chen

Abstract

We introduce the Seismic Waveforms dataset for Automatic Neural-network processing (SWAN), a comprehensive and standardized benchmark designed to advance data-driven seismic signal processing. SWAN aggregates diverse synthetic and real seismic waveforms spanning a wide range of geological structures, noise conditions, propagation environments, and acquisition geometries, providing a unified foundation for training highly generalizable models. Leveraging this dataset, we develop and evaluate a conditionally constrained residual diffusion model for core seismic processing tasks, focusing on missing-trace reconstruction. Extensive experiments demonstrate that diffusion models trained on SWAN achieve state-of-the-art performance across heterogeneous testing scenarios, outperforming leading deep-learning and physics-based baselines on both synthetic benchmarks and field data examples. The results highlight SWAN's value as both a scalable training corpus and a rigorous evaluation framework, and illustrate the strong potential of diffusion-based architectures for robust, generalizable seismic data processing.

Training a generalizable diffusion model for seismic data processing using a large-scale open-source waveform dataset

Abstract

Paper Structure (25 sections, 9 equations, 12 figures, 2 tables)

This paper contains 25 sections, 9 equations, 12 figures, 2 tables.

Introduction
SWAN Dataset
Data Processing Pipeline
Dataset Composition
Methodology
Diffusion Models
Residual-Guided Diffusion Model (RGDM)
Forward Process
Reverse Process
Training Objective
Numerical Experiments
Example 1: Synthetic Hyperbolic Data
Example 2: Synthetic Edge-Structure Data
Example 3: 3D Synthetic Hyperbolic Volume
Example 4: Synthetic DAS Data
...and 10 more sections

Figures (12)

Figure 1: Overview of the SWAN data processing pipeline, including synthetic and real data sources, patch extraction, normalization, quality filtering, and metadata generation.
Figure 2: Representative $128\times128$ patches sampled from the four SWAN categories. Each group of three rows corresponds to one data type and is outlined using a distinct border color: real poststack (red, rows 1--3), real prestack (teal, rows 4--6), synthetic poststack (blue, rows 7--9), and synthetic prestack (green, rows 10--12).
Figure 3: Residual-guided diffusion framework used for seismic reconstruction. The training stage (top) learns residual increments, while the sampling stage (bottom) applies deterministic reverse diffusion conditioned on the observed waveform.
Figure 4: Example 1. Interpolation of a synthetic hyperbolic gather with 50% irregular sampling. (a) Complete data. (b)--(e) Reconstruction results of POCS, DRR, PySeisTr, and the proposed method. (f) Observed gather with 50% missing traces. (g)--(j) Corresponding residual panels.
Figure 5: Example 2. Interpolation of a synthetic edge-structure gather with 50% irregular sampling. (a) Complete data. (b)--(e) Reconstruction results of POCS, DRR, PySeisTr, and the proposed method. (f) Observed gather with 50% missing traces. (g)--(j) Corresponding residual panels.
...and 7 more figures

Training a generalizable diffusion model for seismic data processing using a large-scale open-source waveform dataset

Abstract

Training a generalizable diffusion model for seismic data processing using a large-scale open-source waveform dataset

Authors

Abstract

Table of Contents

Figures (12)