Table of Contents
Fetching ...

Low-ordered Orthogonal Voxel Finite Element with INT8 Tensor Cores for GPU-based Explicit Elastic Wave Propagation Analysis

Tsuyoshi Ichimura, Kohei Fujita, Muneo Hori, Maddegedara Lalith

TL;DR

Fast, explicit elastic wave simulations in large 3D media are challenging due to computational cost and numerical dispersion. The paper proposes TCOVFEM, an OVFEM-based approach that leverages INT8 Tensor Cores by transforming the core operator into an INT8-friendly form and using a hierarchical FP64–INT64–INT8 expansion to preserve accuracy. The authors implement the method on an NVIDIA A100 GPU and demonstrate that ds=2 mm with TCOVFEM achieves FP64-equivalent accuracy while delivering substantial speedups, up to 17× relative to conventional VFEM and 4.5× for the matrix-vector kernel. The approach offers a practical route to accelerating large-scale explicit wave simulations on GPUs and may extend to multi-GPU settings and other linear isotropic elastic problems.

Abstract

Faster explicit elastic wavefield simulations are required for large and complex three-dimensional media using a structured finite element method. Such wavefield simulations are suitable for GPUs, which have exhibited improved computational performance in recent years, and the use of GPUs is expected to speed up such simulations. However, available computational performance on GPUs is typically not fully exploited, and the conventional method involves some numerical dispersion. Thus, in this paper, we propose an explicit structured-mesh wavefield simulation method that uses INT8 Tensor Cores and reduces numerical dispersion to speed up computation on GPUs. The proposed method was implemented for GPUs, and its performance was evaluated in a simulation experiment of a real-world problem. The results demonstrate that the proposed method is 17.0 times faster than the conventional method.

Low-ordered Orthogonal Voxel Finite Element with INT8 Tensor Cores for GPU-based Explicit Elastic Wave Propagation Analysis

TL;DR

Fast, explicit elastic wave simulations in large 3D media are challenging due to computational cost and numerical dispersion. The paper proposes TCOVFEM, an OVFEM-based approach that leverages INT8 Tensor Cores by transforming the core operator into an INT8-friendly form and using a hierarchical FP64–INT64–INT8 expansion to preserve accuracy. The authors implement the method on an NVIDIA A100 GPU and demonstrate that ds=2 mm with TCOVFEM achieves FP64-equivalent accuracy while delivering substantial speedups, up to 17× relative to conventional VFEM and 4.5× for the matrix-vector kernel. The approach offers a practical route to accelerating large-scale explicit wave simulations on GPUs and may extend to multi-GPU settings and other linear isotropic elastic problems.

Abstract

Faster explicit elastic wavefield simulations are required for large and complex three-dimensional media using a structured finite element method. Such wavefield simulations are suitable for GPUs, which have exhibited improved computational performance in recent years, and the use of GPUs is expected to speed up such simulations. However, available computational performance on GPUs is typically not fully exploited, and the conventional method involves some numerical dispersion. Thus, in this paper, we propose an explicit structured-mesh wavefield simulation method that uses INT8 Tensor Cores and reduces numerical dispersion to speed up computation on GPUs. The proposed method was implemented for GPUs, and its performance was evaluated in a simulation experiment of a real-world problem. The results demonstrate that the proposed method is 17.0 times faster than the conventional method.
Paper Structure (4 sections, 21 equations, 6 figures, 4 tables)

This paper contains 4 sections, 21 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Definition of a voxel element. For a cube of size $ds$ in each direction, a local coordinate $r_1r_2r_3$ is defined ($-1 \le r_1 \le 1$, $-1 \le r_2 \le 1$, $-1 \le r_3 \le 1$). The local node number ranging from 1--8 is defined on each node in the element. $\phi^{\beta}$ is the basis function for node $\beta$.
  • Figure 2: Computation of matrix-vector products using INT8 Tensor Cores with direct FP64-INT8 conversion and the proposed hierarchical FP64-INT64-INT8 conversion. Here, 32 elements are computed using 32 threads per thread block.
  • Figure 3: Decomposition of (24$\times$48)$\times$(48$\times$32) matrix-matrix product into nine (8$\times$16)$\times$(16$\times$32) matrix-matrix products. Here, the A fragment can be reused throughout the $M$ sets of computations. Note that the results in the 32-bit integer C fragment are flushed every $M=3$ stages to an INT64 buffer to avoid overflow.
  • Figure 4: Elimination of shared memory loads/stores by direct addition of results to global memory. Note that one out of the three C fragments is shown. 32 elements are computed using 32 threads per thread block.
  • Figure 5: Model used for the numerical experiment.
  • ...and 1 more figures