Table of Contents
Fetching ...

A Global-scale Database of Seismic Phases from Cloud-based Picking at Petabyte Scale

Yiyu Ni, Marine A. Denolle, Amanda M. Thomas, Alex Hamilton, Jannes Münchmeyer, Yinzhi Wang, Loïc Bachelot, Chad Trabant, David Mencin

TL;DR

We address global-scale seismic phase picking by building a public database of 4.3 billion P- and S-wave picks extracted from 1.3 PB of continuous waveform data using a cloud-native workflow on AWS. The pipeline processes ~145,000 containerized jobs across 47,354 stations from 2002–2025, leveraging PhaseNet via SeisBench for arrivals and storing results in AWS DocumentDB. The results reveal Omori-law decay signatures in pick counts, seasonal noise effects, and densely sampled regional coverage, illustrating the database's value for earthquake catalogs and ML datasets. The work demonstrates petabyte-scale seismic data mining in the cloud and provides automated, open-access data products for the community.

Abstract

We present the first global-scale database of 4.3 billion P- and S-wave picks extracted from 1.3 PB continuous seismic data via a cloud-native workflow. Using cloud computing services on Amazon Web Services, we launched ~145,000 containerized jobs on continuous records from 47,354 stations spanning 2002-2025, completing in under three days. Phase arrivals were identified with a deep learning model, PhaseNet, through an open-source Python ecosystem for deep learning, SeisBench. To visualize and gain a global understanding of these picks, we present preliminary results about pick time series revealing Omori-law aftershock decay, seasonal variations linked to noise levels, and dense regional coverage that will enhance earthquake catalogs and machine-learning datasets. We provide all picks in a publicly queryable database, providing a powerful resource for researchers studying seismicity around the world. This report provides insights into the database and the underlying workflow, demonstrating the feasibility of petabyte-scale seismic data mining on the cloud and of providing intelligent data products to the community in an automated manner.

A Global-scale Database of Seismic Phases from Cloud-based Picking at Petabyte Scale

TL;DR

We address global-scale seismic phase picking by building a public database of 4.3 billion P- and S-wave picks extracted from 1.3 PB of continuous waveform data using a cloud-native workflow on AWS. The pipeline processes ~145,000 containerized jobs across 47,354 stations from 2002–2025, leveraging PhaseNet via SeisBench for arrivals and storing results in AWS DocumentDB. The results reveal Omori-law decay signatures in pick counts, seasonal noise effects, and densely sampled regional coverage, illustrating the database's value for earthquake catalogs and ML datasets. The work demonstrates petabyte-scale seismic data mining in the cloud and provides automated, open-access data products for the community.

Abstract

We present the first global-scale database of 4.3 billion P- and S-wave picks extracted from 1.3 PB continuous seismic data via a cloud-native workflow. Using cloud computing services on Amazon Web Services, we launched ~145,000 containerized jobs on continuous records from 47,354 stations spanning 2002-2025, completing in under three days. Phase arrivals were identified with a deep learning model, PhaseNet, through an open-source Python ecosystem for deep learning, SeisBench. To visualize and gain a global understanding of these picks, we present preliminary results about pick time series revealing Omori-law aftershock decay, seasonal variations linked to noise levels, and dense regional coverage that will enhance earthquake catalogs and machine-learning datasets. We provide all picks in a publicly queryable database, providing a powerful resource for researchers studying seismicity around the world. This report provides insights into the database and the underlying workflow, demonstrating the feasibility of petabyte-scale seismic data mining on the cloud and of providing intelligent data products to the community in an automated manner.

Paper Structure

This paper contains 10 sections, 5 figures.

Figures (5)

  • Figure 1: Map of stations, displayed as triangles and color coded according to the Data availability in station-years included in EarthScope, NCEDC, and SCEDC archives. Panel (a) shows the distribution of global station data availability. Panel (b) shows a detailed view of stations in the United States, while Panel (c) shows data availability in California.
  • Figure 2: The scalable cloud-native workflow for seismic phase picking. Containerized jobs are submitted to AWS Batch, which loads miniSEED seismic waveforms directly from AWS S3 buckets. Phase arrivals were identified with PhaseNet through the SeisBench implementation. A DocumentDB cluster is employed to store job metadata, picks, and checkpoints. Finally, an EC2 instance is used to provide a public database query service.
  • Figure 3: Detailed job run history for the EarthScope, NCEDC, and SCEDC dataset. Panel (a) shows the Job ID as a function of time, color-coded by year the data was recorded, for the EarthScope dataset. Panel (b) shows the job progression for the NCEDC and SCEDC datasets. Panels (c) and (d) show the pending jobs as a function of time. Panels (e) and (f) show the running jobs as a function of time. The horizontal dashed line represents the job quota.
  • Figure 4: Daily picks for selected stations. Stations are indicated by triangles on the central map, annotated with location and channel codes. For each of the ten example stations, the detail plots show a time series of the number of picks per day and a 28-day moving average in red.
  • Figure 5: Omori-type decays of the number of picks per day (blue) and the number of events per day (red). We use picks at the reference station provided in the figure title. Event counts for the Illapel earthquake are from the Chilean Seismic Network (CSN) catalog, for the other two examples from the International Seismological Centre (ISC) and United States Geological Survey (USGS) catalog. In each case, we count all events with at most a 1.5-degree difference in latitude and longitude from the reference station.