Table of Contents
Fetching ...

Auto-Tuning for OpenMP Dynamic Scheduling applied to Full Waveform Inversion

Felipe H. S. da Silva, João B. Fernandes, Idalmis M. Sardina, Tiago Barros, Samuel Xavier-de-Souza, Italo A. S. Assis

TL;DR

FWI is a compute-heavy seismic inversion problem; the paper proposes auto-tuning the OpenMP dynamic chunk size using PATSMA with Coupled Simulated Annealing to adapt workload distribution. It defines the auto-tuning cost as the runtime of the first time step of the first seismic shot in the first FWI iteration, and applies the resulting chunk size across the forward, reverse, and checkpointing loops. Across six computing environments and multiple problem sizes, the approach yields substantial speedups, up to 70.46%, with overhead below 1.2%, and demonstrates stable improvements as the problem scales. The work contributes a reproducible, code-available strategy for dynamic scheduler tuning in seismic inversion, enabling better utilization of shared-memory HPC.

Abstract

Full Waveform Inversion (FWI) is a widely used method in seismic data processing, capable of estimating models that represent the characteristics of the geological layers of the subsurface. Because it works with a massive amount of data, the execution of this method requires much time and computational resources. Techniques such as FWI adapt well to parallel computing and can be parallelized in shared memory systems using the application programming interface (API) OpenMP. The management of parallel tasks can be performed through loop schedulers contained in OpenMP. The dynamic scheduler stands out for distributing predefined fixed-size chunk sizes to idle processing cores at runtime. It can better adapt to FWI, where data processing can be irregular. However, the relationship between the size of the chunk and the runtime is unknown. Optimization techniques can employ meta-heuristics to explore the parameter search space, avoiding testing all possible solutions. Here, we propose a strategy to use the Parameter Auto-Tuning for Shared Memory Algorithms (PATSMA), with Coupled Simulated Annealing (CSA) as its optimization method, to automatically adjust the chunk for the dynamic scheduling of wave propagation, one of the most expensive steps in FWI. Since testing each candidate chunk in the complete FWI is unpractical, our approach consists of running a PATSMA where the objective function is the runtime of the first time iteration of the first seismic shot of the first FWI iteration. The resulting chunk is then employed in all wave propagations involved in an FWI. We conducted tests to measure the runtime of an FWI using the proposed auto-tuning, varying the problem size and running on different computational environments. The results show that applying the proposed auto-tuning in an FWI reduces its runtime by up to 70.46% compared to standard OpenMP schedulers.

Auto-Tuning for OpenMP Dynamic Scheduling applied to Full Waveform Inversion

TL;DR

FWI is a compute-heavy seismic inversion problem; the paper proposes auto-tuning the OpenMP dynamic chunk size using PATSMA with Coupled Simulated Annealing to adapt workload distribution. It defines the auto-tuning cost as the runtime of the first time step of the first seismic shot in the first FWI iteration, and applies the resulting chunk size across the forward, reverse, and checkpointing loops. Across six computing environments and multiple problem sizes, the approach yields substantial speedups, up to 70.46%, with overhead below 1.2%, and demonstrates stable improvements as the problem scales. The work contributes a reproducible, code-available strategy for dynamic scheduler tuning in seismic inversion, enabling better utilization of shared-memory HPC.

Abstract

Full Waveform Inversion (FWI) is a widely used method in seismic data processing, capable of estimating models that represent the characteristics of the geological layers of the subsurface. Because it works with a massive amount of data, the execution of this method requires much time and computational resources. Techniques such as FWI adapt well to parallel computing and can be parallelized in shared memory systems using the application programming interface (API) OpenMP. The management of parallel tasks can be performed through loop schedulers contained in OpenMP. The dynamic scheduler stands out for distributing predefined fixed-size chunk sizes to idle processing cores at runtime. It can better adapt to FWI, where data processing can be irregular. However, the relationship between the size of the chunk and the runtime is unknown. Optimization techniques can employ meta-heuristics to explore the parameter search space, avoiding testing all possible solutions. Here, we propose a strategy to use the Parameter Auto-Tuning for Shared Memory Algorithms (PATSMA), with Coupled Simulated Annealing (CSA) as its optimization method, to automatically adjust the chunk for the dynamic scheduling of wave propagation, one of the most expensive steps in FWI. Since testing each candidate chunk in the complete FWI is unpractical, our approach consists of running a PATSMA where the objective function is the runtime of the first time iteration of the first seismic shot of the first FWI iteration. The resulting chunk is then employed in all wave propagations involved in an FWI. We conducted tests to measure the runtime of an FWI using the proposed auto-tuning, varying the problem size and running on different computational environments. The results show that applying the proposed auto-tuning in an FWI reduces its runtime by up to 70.46% compared to standard OpenMP schedulers.
Paper Structure (13 sections, 3 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 13 sections, 3 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: Velocity model with dimensions $n1 = n2 = n3 = 400$.
  • Figure 2: Runtime per shot for FWI using the proposed auto-tuning compared to the OpenMP static and guided schedulers on the NPAD machine, with $16$ shots. For the OpenMP schedulers, the chunk size was not explicitly set. The velocity model size for these tests was $(n1,n2,n3)=(200,400,400)$.
  • Figure 3: Speedup for FWI using the proposed auto-tuning compared to the OpenMP static and guided schedulers on the NPAD machine, with $1$, $2$, $4$, $8$, $16$, $32$, $64$ and $128$ shots. For the OpenMP schedulers, the chunk size was not explicitly set. The velocity model size for these tests was $(n1,n2,n3)=(200,400,400)$. Each point is a median of five executions.
  • Figure 4: Single-shot FWI speedup using the proposed auto-tuning compared to OpenMP (a) static and (b) guided schedulers on $5$ machines (OPT3, SD, NPAD, STD3, STDE4, and DENSE), for three input sizes, $(n1,n2,n3)=(100,400,400)$, $(200,400,400)$, and $(400,400,400)$. For OpenMP schedulers, the chunk size was not explicitly set. Each point is a median of at least five runs.
  • Figure 5: Single-shot FWI runtime using the proposed auto-tuning compared to the default OpenMP static and guided schedulers, and the OpenMP dynamic, static, and guided using the chunk size proposed by auto4omp2022 (marked with *). This set of experiments was performed on the SD machine for three input sizes, $(n1,n2,n3) =$ (a) $(100,400,400)$, (b) $(200,400,400)$, and (c) $(400,400,400)$. Each point is a median of five runs.
  • ...and 1 more figures