Parallel performance of shared memory parallel spectral deferred corrections

Philip Freese; Sebastian Götschel; Thibaut Lunet; Daniel Ruprecht; Martin Schreiber

Parallel performance of shared memory parallel spectral deferred corrections

Philip Freese, Sebastian Götschel, Thibaut Lunet, Daniel Ruprecht, Martin Schreiber

TL;DR

This work describes parallel OpenMP -based implementations of parallel Spectral Deferred Corrections for two well established simulation codes: the finite volume based operational ocean model ICON-O and the spherical harmonics based research code SWEET .

Abstract

We investigate parallel performance of parallel spectral deferred corrections, a numerical approach that provides small-scale parallelism for the numerical solution of initial value problems. The scheme is applied to the shallow water equation and uses an IMEX splitting that integrates fast modes implicitly and slow modes explicitly in order to be efficient. We describe parallel $\texttt{OpenMP}$-based implementations of parallel SDC in two well established simulation codes: the finite volume based operational ocean model $\texttt{ICON-O}$ and the spherical harmonics based research code $\texttt{SWEET}$. The implementations are benchmarked on a single node of the JUSUF ($\texttt{SWEET}$) and JUWELS ($\texttt{ICON-O}$) system at Jülich Supercomputing Centre. We demonstrate a reduction of time-to-solution across a range of accuracies. For $\texttt{ICON-O}$, we show speedup over the currently used Adams--Bashforth-2 integrator with $\texttt{OpenMP}$ loop parallelization. For $\texttt{SWEET}$, we show speedup over serial spectral deferred corrections and a second order implicit-explicit integrator.

Parallel performance of shared memory parallel spectral deferred corrections

TL;DR

Abstract

-based implementations of parallel SDC in two well established simulation codes: the finite volume based operational ocean model

and the spherical harmonics based research code

. The implementations are benchmarked on a single node of the JUSUF (

) and JUWELS (

) system at Jülich Supercomputing Centre. We demonstrate a reduction of time-to-solution across a range of accuracies. For

, we show speedup over the currently used Adams--Bashforth-2 integrator with

loop parallelization. For

, we show speedup over serial spectral deferred corrections and a second order implicit-explicit integrator.

Paper Structure (16 sections, 9 equations, 4 figures, 1 table)

This paper contains 16 sections, 9 equations, 4 figures, 1 table.

Introduction
Related work
Models
ICON-O
SWEET
Parallel SDC
SDC in ICON-O
SDC in SWEET
Performance Model
Results
Strong scaling of parallel SDC
Work precision
Speedup
Discussion and outlook
Acknowledgments.
...and 1 more sections

Figures (4)

Figure 1: Vorticity contours for the Galewesky test case at the end of day 6.
Figure 2: Strong scaling tests for ICON-O (left) and SWEET (right). Top: using a problem size $N_\text{dofs}=163842$ with ICON-O and $N_\text{dofs}=512^2=262144$ with SWEET. Bottom: using a problem size $N_\text{dofs}=40962$ with ICON-O and $N_\text{dofs}=256^2=65536$ with SWEET. Ideal speedup of time-parallel pSDC determined using the performance model is drawn with the dashed black horizontal lines.
Figure 3: Work-precision plot for ICON-O (left) and SWEET (right). Left: using $48$ threads, i.e., for AB and space only pSDC $48$ in space and for space-time pSDC $4$ in time and $12$ in space. Various time-step sizes are used, for AB ${\vartriangle}t \in \{120, 60, 30, 15\}$ (s) and for pSDC ${\vartriangle}t \in \{1500, 1200, 960, 600, 300\}$ (s). AB is shown with CG tolerances of $10^{-13}$ and $10^{-7}$, pSDC uses $10^{-7}$. Right : using $64$ threads for IMEX and pSDC (space), and nested parallelization with $32$ threads in space and $4$ threads in time for pSDC (space-time). Various time-step sizes are used, for IMEX ${\vartriangle}t \in \{60, ..., 3.75\}$ (s) and for pSDC ${\vartriangle}t \in \{450, ..., 60\}$ (s)
Figure 4: Speedup of pSDC compared to reference time-integration schemes for a similar accuracy. Left: for ICON-O, comparing AB (${\vartriangle}t=30$s) and pSDC (${\vartriangle}t=960$s) at similar relative accuracy of $\approx 0.013$, using AB with 2 OpenMP threads (base configuration in ICON-O) as base method for speed-up. Right: for SWEET, comparing IMEX (${\vartriangle}t=15$s) and pSDC (${\vartriangle}t=180$s) at similar absolute accuracy of $\approx 1.5\cdot10^{-5}$.

Parallel performance of shared memory parallel spectral deferred corrections

TL;DR

Abstract

Parallel performance of shared memory parallel spectral deferred corrections

Authors

TL;DR

Abstract

Table of Contents

Figures (4)