Table of Contents
Fetching ...

Scrutinizing Variables for Checkpoint Using Automatic Differentiation

Xin Huang, Weiping Zhang, Shiman Meng, Wubiao Xu, Xiang Fu, Luanzheng Guo, Kento Sato

TL;DR

A systematic approach that leverages automatic differentiation (AD) to scrutinize every element within variables necessary for checkpointing to identify critical and uncritical elements and eliminate uncritical elements from checkpointing is proposed.

Abstract

Checkpoint/Restart (C/R) saves the running state of the programs periodically, which consumes considerable system resources. We observe that not every piece of data is involved in the computation in typical HPC applications; such unused data should be excluded from checkpointing for better storage/compute efficiency. To find out, we propose a systematic approach that leverages automatic differentiation (AD) to scrutinize every element within variables (e.g., arrays) for checkpointing allowing us to identify critical/uncritical elements and eliminate uncritical elements from checkpointing. Specifically, we inspect every single element within a variable for checkpointing with an AD tool to determine whether the element has an impact on the application output or not. We empirically validate our approach with eight benchmarks from the NAS Parallel Benchmark (NPB) suite. We successfully visualize critical/uncritical elements/regions within a variable with respect to its impact (yes or no) on the application output. We find patterns/distributions of critical/uncritical elements/regions quite interesting and follow the physical formulation/logic of the algorithm.The evaluation on NPB benchmarks shows that our approach saves storage for checkpointing by up to 20%.

Scrutinizing Variables for Checkpoint Using Automatic Differentiation

TL;DR

A systematic approach that leverages automatic differentiation (AD) to scrutinize every element within variables necessary for checkpointing to identify critical and uncritical elements and eliminate uncritical elements from checkpointing is proposed.

Abstract

Checkpoint/Restart (C/R) saves the running state of the programs periodically, which consumes considerable system resources. We observe that not every piece of data is involved in the computation in typical HPC applications; such unused data should be excluded from checkpointing for better storage/compute efficiency. To find out, we propose a systematic approach that leverages automatic differentiation (AD) to scrutinize every element within variables (e.g., arrays) for checkpointing allowing us to identify critical/uncritical elements and eliminate uncritical elements from checkpointing. Specifically, we inspect every single element within a variable for checkpointing with an AD tool to determine whether the element has an impact on the application output or not. We empirically validate our approach with eight benchmarks from the NAS Parallel Benchmark (NPB) suite. We successfully visualize critical/uncritical elements/regions within a variable with respect to its impact (yes or no) on the application output. We find patterns/distributions of critical/uncritical elements/regions quite interesting and follow the physical formulation/logic of the algorithm.The evaluation on NPB benchmarks shows that our approach saves storage for checkpointing by up to 20%.
Paper Structure (15 sections, 1 equation, 8 figures, 3 tables)

This paper contains 15 sections, 1 equation, 8 figures, 3 tables.

Figures (8)

  • Figure 1: An example of AD workflow. $a$ is a constant.
  • Figure 2: Source code of the function $error\_norm$ in $BT$
  • Figure 3: A typical critical-uncritical distribution in NPB benchmarks(red: critical, blue: uncritical). Variables following this distribution: $u(BT)$, $u(SP)$, $u[x][y][z][0](LU)$, $u[x][y][z][1](LU)$, $u[x][y][z][2](LU)$, $u[x][y][z][3](LU)$, $rho\_i(LU)$, $qs(LU)$, $rsd(LU)$
  • Figure 4: Critical-uncritical distribution of array $u$ in MG
  • Figure 5: Critical-uncritical distribution of of array $r$ in MG
  • ...and 3 more figures