PyAWD: A Library for Generating Large Synthetic Datasets of Acoustic Wave Propagation
Pascal Tribel, Gianluca Bontempi
TL;DR
PyAWD tackles data sparsity in seismic ML by providing a Python library that generates large, high-resolution synthetic datasets of spatio-temporal acoustic wave propagation in 2D and 3D heterogeneous media. It solves the anisotropic nondispersive Acoustic Wave Equation $ \frac{d^2u}{dt^2} = c\nabla^2 u - \alpha \frac{du}{dt} + f $ via Devito, offering PyTorch-compatible datasets with on-the-fly generation and interrogator probes for ML pipelines. The authors demonstrate utility with a 2D epicenter retrieval task in a Marmousi field, evaluating several ML models and performing data-budgeting analyses to reveal data requirements and model robustness; TCNN and Extra Trees emerge as top performers. Overall, PyAWD provides a practical path to generate rich, ML-ready seismic data, enabling exploration of model selection, data budgeting, and transfer learning, while future work will integrate real data and extend to more complex wave equations and source models.
Abstract
Seismic data is often sparse and unevenly distributed due to the high costs and logistical challenges associated with deploying physical seismometers, limiting the application of Machine Learning (ML) in earthquake analysis. While simulation methods exist, no tool allows the generation of large datasets containing simulated measurements of the ground motion. To address this gap, we introduce PyAWD, a Python library designed to generate high-resolution synthetic datasets simulating spatio-temporal acoustic wave propagation in both two-dimensional and three-dimensional heterogeneous media. By allowing fine control over parameters such as the wave speed, external forces, spatial and temporal discretization, and media composition, PyAWD enables the creation of ML-scale datasets that capture the complexity of seismic wave behavior. We illustrate the library's potential with an epicenter retrieval task, showcasing its suitability for designing complex, accurate seismic problems that require advanced ML approaches in the absence or lack of dense real-world data. We also show the usefulness of our tool to tackle the problem of data budgeting in the framework of epicenter retrieval.
