Table of Contents
Fetching ...

Parallel Algorithms for Successive Convolution

Andrew J. Christlieb, Pierson T. Guthrey, William A. Sands, Mathialakan Thavappiragasm

TL;DR

This work explores the performance of alternative discretizations for partial differential equations (PDEs) by developing a domain decomposition algorithm suitable for distributed memory systems along with shared memory algorithms and analyzes several approaches for implementing the parallel algorithms by optimizing predominant loop structures and maximizing data reuse.

Abstract

In this work, we consider alternative discretizations for PDEs which use expansions involving integral operators to approximate spatial derivatives. These constructions use explicit information within the integral terms, but treat boundary data implicitly, which contributes to the overall speed of the method. This approach is provably unconditionally stable for linear problems and stability has been demonstrated experimentally for nonlinear problems. Additionally, it is matrix-free in the sense that it is not necessary to invert linear systems and iteration is not required for nonlinear terms. Moreover, the scheme employs a fast summation algorithm that yields a method with a computational complexity of $\mathcal{O}(N)$, where $N$ is the number of mesh points along a direction. While much work has been done to explore the theory behind these methods, their practicality in large scale computing environments is a largely unexplored topic. In this work, we explore the performance of these methods by developing a domain decomposition algorithm suitable for distributed memory systems along with shared memory algorithms. As a first pass, we derive an artificial CFL condition that enforces a nearest-neighbor communication pattern and briefly discuss possible generalizations. We also analyze several approaches for implementing the parallel algorithms by optimizing predominant loop structures and maximizing data reuse. Using a hybrid design that employs MPI and Kokkos for the distributed and shared memory components of the algorithms, respectively, we show that our methods are efficient and can sustain an update rate $> 1\times10^8$ DOF/node/s. We provide results that demonstrate the scalability and versatility of our algorithms using several different PDE test problems, including a nonlinear example, which employs an adaptive time-stepping rule.

Parallel Algorithms for Successive Convolution

TL;DR

This work explores the performance of alternative discretizations for partial differential equations (PDEs) by developing a domain decomposition algorithm suitable for distributed memory systems along with shared memory algorithms and analyzes several approaches for implementing the parallel algorithms by optimizing predominant loop structures and maximizing data reuse.

Abstract

In this work, we consider alternative discretizations for PDEs which use expansions involving integral operators to approximate spatial derivatives. These constructions use explicit information within the integral terms, but treat boundary data implicitly, which contributes to the overall speed of the method. This approach is provably unconditionally stable for linear problems and stability has been demonstrated experimentally for nonlinear problems. Additionally, it is matrix-free in the sense that it is not necessary to invert linear systems and iteration is not required for nonlinear terms. Moreover, the scheme employs a fast summation algorithm that yields a method with a computational complexity of , where is the number of mesh points along a direction. While much work has been done to explore the theory behind these methods, their practicality in large scale computing environments is a largely unexplored topic. In this work, we explore the performance of these methods by developing a domain decomposition algorithm suitable for distributed memory systems along with shared memory algorithms. As a first pass, we derive an artificial CFL condition that enforces a nearest-neighbor communication pattern and briefly discuss possible generalizations. We also analyze several approaches for implementing the parallel algorithms by optimizing predominant loop structures and maximizing data reuse. Using a hybrid design that employs MPI and Kokkos for the distributed and shared memory components of the algorithms, respectively, we show that our methods are efficient and can sustain an update rate DOF/node/s. We provide results that demonstrate the scalability and versatility of our algorithms using several different PDE test problems, including a nonlinear example, which employs an adaptive time-stepping rule.

Paper Structure

This paper contains 28 sections, 89 equations, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 1: Stencils used to build the sixth-order quadrature christlieb2019kernelchristlieb2020_NDAD
  • Figure 2: A sixth-order WENO quadrature stencil in 2-D.
  • Figure 3: Fast convolution communication stencil in 2D based on N-Ns.
  • Figure 4: Heterogeneous platform targeted by Kokkos kokkos2014
  • Figure 5: Plots comparing the performance of different parallel execution policies for the pattern in \ref{['3D_pattern_1']} using test cases in 2D (left) and 3D (right). Tests were conducted on a single node that consists of 40 cores using the code configuration outlined in \ref{['tab:compiler and opt flags']}. Each group consists of three plots, whose difference is the value selected for the team size. We note that hyperthreading is not enabled on our systems, so Kokkos::AUTO() defaults to a team size of 1. In each pane, we use "best" to refer to the best run for that configuration across different team sizes. Tile experiments used block sizes of $256^2$, in 2D problems, and $32^3$ in 3D. We observe that vectorized policies are generally faster than non-vectorized policies. Interestingly, among blocked/tiled policies, construction of subviews appears to be faster than those that skip the subview construction, despite the additional work. As the problem size increases, the performance of blocked policies improves substantially. This can be attributed to the large number of idle thread teams when the problem size does not produce enough blocks. In such cases, increasing the size of the team does offer an improvement, as it reduces the number of idle thread teams. For non-blocked policies, we observe that increasing the team-size generally results in minimal, if any, improvement in performance. In all cases, the use of blocking provides a more consistent update rate when enough work is introduced.
  • ...and 7 more figures