A Global-scale Database of Seismic Phases from Cloud-based Picking at Petabyte Scale
Yiyu Ni, Marine A. Denolle, Amanda M. Thomas, Alex Hamilton, Jannes Münchmeyer, Yinzhi Wang, Loïc Bachelot, Chad Trabant, David Mencin
TL;DR
We address global-scale seismic phase picking by building a public database of 4.3 billion P- and S-wave picks extracted from 1.3 PB of continuous waveform data using a cloud-native workflow on AWS. The pipeline processes ~145,000 containerized jobs across 47,354 stations from 2002–2025, leveraging PhaseNet via SeisBench for arrivals and storing results in AWS DocumentDB. The results reveal Omori-law decay signatures in pick counts, seasonal noise effects, and densely sampled regional coverage, illustrating the database's value for earthquake catalogs and ML datasets. The work demonstrates petabyte-scale seismic data mining in the cloud and provides automated, open-access data products for the community.
Abstract
We present the first global-scale database of 4.3 billion P- and S-wave picks extracted from 1.3 PB continuous seismic data via a cloud-native workflow. Using cloud computing services on Amazon Web Services, we launched ~145,000 containerized jobs on continuous records from 47,354 stations spanning 2002-2025, completing in under three days. Phase arrivals were identified with a deep learning model, PhaseNet, through an open-source Python ecosystem for deep learning, SeisBench. To visualize and gain a global understanding of these picks, we present preliminary results about pick time series revealing Omori-law aftershock decay, seasonal variations linked to noise levels, and dense regional coverage that will enhance earthquake catalogs and machine-learning datasets. We provide all picks in a publicly queryable database, providing a powerful resource for researchers studying seismicity around the world. This report provides insights into the database and the underlying workflow, demonstrating the feasibility of petabyte-scale seismic data mining on the cloud and of providing intelligent data products to the community in an automated manner.
