Table of Contents
Fetching ...

Training a generalizable diffusion model for seismic data processing using a large-scale open-source waveform dataset

Xinyue Gong, Sergey Fomel, Yangkang Chen

Abstract

We introduce the Seismic Waveforms dataset for Automatic Neural-network processing (SWAN), a comprehensive and standardized benchmark designed to advance data-driven seismic signal processing. SWAN aggregates diverse synthetic and real seismic waveforms spanning a wide range of geological structures, noise conditions, propagation environments, and acquisition geometries, providing a unified foundation for training highly generalizable models. Leveraging this dataset, we develop and evaluate a conditionally constrained residual diffusion model for core seismic processing tasks, focusing on missing-trace reconstruction. Extensive experiments demonstrate that diffusion models trained on SWAN achieve state-of-the-art performance across heterogeneous testing scenarios, outperforming leading deep-learning and physics-based baselines on both synthetic benchmarks and field data examples. The results highlight SWAN's value as both a scalable training corpus and a rigorous evaluation framework, and illustrate the strong potential of diffusion-based architectures for robust, generalizable seismic data processing.

Training a generalizable diffusion model for seismic data processing using a large-scale open-source waveform dataset

Abstract

We introduce the Seismic Waveforms dataset for Automatic Neural-network processing (SWAN), a comprehensive and standardized benchmark designed to advance data-driven seismic signal processing. SWAN aggregates diverse synthetic and real seismic waveforms spanning a wide range of geological structures, noise conditions, propagation environments, and acquisition geometries, providing a unified foundation for training highly generalizable models. Leveraging this dataset, we develop and evaluate a conditionally constrained residual diffusion model for core seismic processing tasks, focusing on missing-trace reconstruction. Extensive experiments demonstrate that diffusion models trained on SWAN achieve state-of-the-art performance across heterogeneous testing scenarios, outperforming leading deep-learning and physics-based baselines on both synthetic benchmarks and field data examples. The results highlight SWAN's value as both a scalable training corpus and a rigorous evaluation framework, and illustrate the strong potential of diffusion-based architectures for robust, generalizable seismic data processing.
Paper Structure (25 sections, 9 equations, 12 figures, 2 tables)

This paper contains 25 sections, 9 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Overview of the SWAN data processing pipeline, including synthetic and real data sources, patch extraction, normalization, quality filtering, and metadata generation.
  • Figure 2: Representative $128\times128$ patches sampled from the four SWAN categories. Each group of three rows corresponds to one data type and is outlined using a distinct border color: real poststack (red, rows 1--3), real prestack (teal, rows 4--6), synthetic poststack (blue, rows 7--9), and synthetic prestack (green, rows 10--12).
  • Figure 3: Residual-guided diffusion framework used for seismic reconstruction. The training stage (top) learns residual increments, while the sampling stage (bottom) applies deterministic reverse diffusion conditioned on the observed waveform.
  • Figure 4: Example 1. Interpolation of a synthetic hyperbolic gather with 50% irregular sampling. (a) Complete data. (b)--(e) Reconstruction results of POCS, DRR, PySeisTr, and the proposed method. (f) Observed gather with 50% missing traces. (g)--(j) Corresponding residual panels.
  • Figure 5: Example 2. Interpolation of a synthetic edge-structure gather with 50% irregular sampling. (a) Complete data. (b)--(e) Reconstruction results of POCS, DRR, PySeisTr, and the proposed method. (f) Observed gather with 50% missing traces. (g)--(j) Corresponding residual panels.
  • ...and 7 more figures