Table of Contents
Fetching ...

Local Adjoints for Simultaneous Preaccumulations with Shared Inputs

Johannes Blühdorn, Nicolas R. Gauger

TL;DR

This work proposes different vector- and map-based approaches for storing local adjoint variables and analyzes them with respect to memory consumption, memory allocation, and adjoint variable access times in the context of simultaneous preaccumulations in multiple threads.

Abstract

In shared-memory parallel automatic differentiation, inputs that are shared among simultaneous thread-local preaccumulations lead to data races if Jacobians are accumulated with a single, shared vector of adjoint variables. In this work, we discuss the benefits and tradeoffs of re-enabling such preaccumulations by a transition to suitable local adjoints. We propose different vector- and map-based approaches for storing local adjoint variables and analyze them with respect to memory consumption, memory allocation, and adjoint variable access times in the context of simultaneous preaccumulations in multiple threads. We implement the approaches in CoDiPack and benchmark them in parallel discrete adjoint computations in the multiphysics simulation suite SU2.

Local Adjoints for Simultaneous Preaccumulations with Shared Inputs

TL;DR

This work proposes different vector- and map-based approaches for storing local adjoint variables and analyzes them with respect to memory consumption, memory allocation, and adjoint variable access times in the context of simultaneous preaccumulations in multiple threads.

Abstract

In shared-memory parallel automatic differentiation, inputs that are shared among simultaneous thread-local preaccumulations lead to data races if Jacobians are accumulated with a single, shared vector of adjoint variables. In this work, we discuss the benefits and tradeoffs of re-enabling such preaccumulations by a transition to suitable local adjoints. We propose different vector- and map-based approaches for storing local adjoint variables and analyze them with respect to memory consumption, memory allocation, and adjoint variable access times in the context of simultaneous preaccumulations in multiple threads. We implement the approaches in CoDiPack and benchmark them in parallel discrete adjoint computations in the multiphysics simulation suite SU2.
Paper Structure (10 sections, 2 equations, 5 figures)

This paper contains 10 sections, 2 equations, 5 figures.

Figures (5)

  • Figure 1: Computational graph in terms of statements with $n=4$ and $m=2$. Nodes are annotated with identifiers (orange) and edges are annotated with partials (green). In the context of preaccumulation, gray edges indicate connections to other parts of the graph.
  • Figure 2: Preaccumulation involves resizing of the shared vector of adjoint/tangent variables, which requires mutual exclusion with tape evaluations in concurrent preaccumulations (a). Simultaneous preaccumulations with shared inputs lead to data races on the shared vector of adjoints/tangents (b). We propose using thread-local memory to mitigate these data races (c).
  • Figure 3: Implementation strategies for local adjoints. Thread-local duplicates of the global adjoint vector (a), addressing into these vectors with an offset (b), map-based local adjoints (c), remapped identifiers with small, dense thread-local vectors (d), tape editing to accelerate multiple such evaluations (e).
  • Figure 4: Single-socket NACA 0012 and Onera M6 tests with different strategies for preaccumulation. On the left, we assess recording and evaluation times for serial and 12-fold OpenMP parallel execution, including speedups relative to serial execution and error bars to indicate variation across runs. On the right, we assess the memory consumption with varying degrees of parallelism as well as joint memory and runtime performance characteristics, and also include measurements of MPI-parallel runs of the classical, MPI-only build of SU2.
  • Figure 5: Multi-node HL-CRM tests with different strategies for preaccumulation and various degrees of parallelism. On the left, we assess recording and evaluation times. Speedups are relative to the performance with 192 cores (eight nodes), error bars indicate variation across runs. On the right, we assess the memory consumption as well as joint memory and runtime performance characteristics, including measurements of MPI-parallel runs of the classical, MPI-only version of SU2.