Table of Contents
Fetching ...

Profiling checkpointing schedules in adjoint ST-AD

Laurent Hascoët, Jean-Luc Bouchot, Shreyas Sunil Gaikwad, Sri Hari Krishna Narayanan, Jan Hückelheim

TL;DR

This work tackles the problem of selecting checkpoint locations on the call-tree of adjoint ST-AD to balance runtime and memory under a fixed storage budget. It introduces a profiling-based greedy approach, implemented inside the Tapenade AD tool, to estimate per-checkpoint cost/benefit (runtime gain and memory cost) and to guide iterative activation/inhibition of static checkpoints. Through MITgcm-derived test cases (notably streamice and tutorial_global_oce_biogeo), the method achieves substantial speedups (up to 2–3× in some configurations) with modest memory overhead and demonstrates beneficial interactions with binomial checkpointing on time-stepping loops. The study also discusses limitations, such as potential inaccuracies from AD optimizations and the need for re-profiling when configurations change, charting a path toward more automated and integrated checkpointing strategies.

Abstract

Checkpointing is a cornerstone of data-flow reversal in adjoint algorithmic differentiation. Checkpointing is a storage/recomputation trade-off that can be applied at different levels, one of which being the call tree. We are looking for good placements of checkpoints onto the call tree of a given application, to reduce run time and memory footprint of its adjoint. There is no known optimal solution to this problem other than a combinatorial search on all placements. We propose a heuristics based on run-time profiling of the adjoint code. We describe implementation of this profiling tool in an existing source-transformation AD tool. We demonstrate the interest of this approach on test cases taken from the MITgcm ocean and atmospheric global circulation model. We discuss the limitations of our approach and propose directions to lift them.

Profiling checkpointing schedules in adjoint ST-AD

TL;DR

This work tackles the problem of selecting checkpoint locations on the call-tree of adjoint ST-AD to balance runtime and memory under a fixed storage budget. It introduces a profiling-based greedy approach, implemented inside the Tapenade AD tool, to estimate per-checkpoint cost/benefit (runtime gain and memory cost) and to guide iterative activation/inhibition of static checkpoints. Through MITgcm-derived test cases (notably streamice and tutorial_global_oce_biogeo), the method achieves substantial speedups (up to 2–3× in some configurations) with modest memory overhead and demonstrates beneficial interactions with binomial checkpointing on time-stepping loops. The study also discusses limitations, such as potential inaccuracies from AD optimizations and the need for re-profiling when configurations change, charting a path toward more automated and integrated checkpointing strategies.

Abstract

Checkpointing is a cornerstone of data-flow reversal in adjoint algorithmic differentiation. Checkpointing is a storage/recomputation trade-off that can be applied at different levels, one of which being the call tree. We are looking for good placements of checkpoints onto the call tree of a given application, to reduce run time and memory footprint of its adjoint. There is no known optimal solution to this problem other than a combinatorial search on all placements. We propose a heuristics based on run-time profiling of the adjoint code. We describe implementation of this profiling tool in an existing source-transformation AD tool. We demonstrate the interest of this approach on test cases taken from the MITgcm ocean and atmospheric global circulation model. We discuss the limitations of our approach and propose directions to lift them.
Paper Structure (16 sections, 9 equations, 5 figures, 1 table)

This paper contains 16 sections, 9 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Sketch of execution for the adjoint of a code arbitrarily split as a sequence of three parts $U$, $C$, and $D$. Thick arrows to the right stand for forward sweeps, running the primal code together with storing intermediate values on the stack. Thick arrows to the left stand for backward sweeps, that retrieve values from the stack and use them while propagating gradients backwards. Time goes top-down, and inside each line time follows the arrows direction.
  • Figure 2: Sketch of execution for the adjoint of the same code than in Fig. \ref{['figBasisModel']}, this time checkpointing code part $C$. Time goes top-down, and inside each line time follows the arrows direction. The thin arrow to the right stands for the extra execution of the primal $C$, and the black and white bullets stand for respectively the writing and reading of the snapshot, i.e. a set of variables sufficient for identical duplicate execution of $C$.
  • Figure 3: Adjoint code with profiling callbacks inserted, corresponding to the code sketch of Fig \ref{['figCheckpointModel']}. Forward sweep on top, turn point in the middle, backward sweep below. Except for turn, callbacks take as argument the called function name and the corresponding line number in the primal code, for later reference.
  • Figure 4: Tuning the adjoint of streamice by profiling. Each checkpointing configuration is shown as a dot whose coordinates are peak stack size and run time of the adjoint. The red and blue dotted lines show the evolution of performance of the adjoint by repeatedly following the suggestions made by our profiling tool. The red line follows a "run-time gain first" strategy, the blue line a "careful on memory" strategy. Time measurements are averaged on 5 runs. Smaller gray dots show the performance of 250 random checkpointing configurations. Green dots show combination with binomial checkpointing, discussed in section \ref{['secWithBinomial']}
  • Figure 5: Tuning the adjoint of tutorial_global_oce_biogeo by profiling. Each checkpointing configuration is shown as a dot whose coordinates are peak stack size and run time of the adjoint. The red and blue lines show the evolution of performance guided by profiling, respectively with a "run-time gain first" strategy, and a "careful on memory" strategy. Gray dots show the performance of 250 random checkpointing configurations.