Table of Contents
Fetching ...

AutoCheck: Automatically Identifying Variables for Checkpointing by Data Dependency Analysis

Xiang Fu, Weiping Zhang, Xin Huang, Wubiao Xu, Shiman Meng, Luanzheng Guo, Kento Sato

TL;DR

An analytical model and a tool that can automatically identify critical variables to checkpoint for C/R are proposed and a set of heuristics that identify critical variables for checkpointing from the refined data dependency graph (DDG).

Abstract

Checkpoint/Restart (C/R) has been widely deployed in numerous HPC systems, Clouds, and industrial data centers, which are typically operated by system engineers. Nevertheless, there is no existing approach that helps system engineers without domain expertise, and domain scientists without system fault tolerance knowledge identify those critical variables accounted for correct application execution restoration in a failure for C/R. To address this problem, we propose an analytical model and a tool (AutoCheck) that can automatically identify critical variables to checkpoint for C/R. AutoCheck relies on first, analytically tracking and optimizing data dependency between variables and other application execution state, and second, a set of heuristics that identify critical variables for checkpointing from the refined data dependency graph (DDG). AutoCheck allows programmers to pinpoint critical variables to checkpoint quickly within a few minutes. We evaluate AutoCheck on 14 representative HPC benchmarks, demonstrating that AutoCheck can efficiently identify correct critical variables to checkpoint.

AutoCheck: Automatically Identifying Variables for Checkpointing by Data Dependency Analysis

TL;DR

An analytical model and a tool that can automatically identify critical variables to checkpoint for C/R are proposed and a set of heuristics that identify critical variables for checkpointing from the refined data dependency graph (DDG).

Abstract

Checkpoint/Restart (C/R) has been widely deployed in numerous HPC systems, Clouds, and industrial data centers, which are typically operated by system engineers. Nevertheless, there is no existing approach that helps system engineers without domain expertise, and domain scientists without system fault tolerance knowledge identify those critical variables accounted for correct application execution restoration in a failure for C/R. To address this problem, we propose an analytical model and a tool (AutoCheck) that can automatically identify critical variables to checkpoint for C/R. AutoCheck relies on first, analytically tracking and optimizing data dependency between variables and other application execution state, and second, a set of heuristics that identify critical variables for checkpointing from the refined data dependency graph (DDG). AutoCheck allows programmers to pinpoint critical variables to checkpoint quickly within a few minutes. We evaluate AutoCheck on 14 representative HPC benchmarks, demonstrating that AutoCheck can efficiently identify correct critical variables to checkpoint.
Paper Structure (24 sections, 7 figures, 4 tables, 2 algorithms)

This paper contains 24 sections, 7 figures, 4 tables, 2 algorithms.

Figures (7)

  • Figure 1: An example of dynamic instruction execution trace, including two instruction blocks.
  • Figure 2: AutoCheck workflow diagram.
  • Figure 3: Pre-processing workflow.
  • Figure 4: Example code.
  • Figure 5: Data dependency analysis (R/W = Read/Write). Note that reg-var map in (a) is updated on-the-fly while passing dynamic instructions. Thus, reg-var map only contains active state at a certain point.
  • ...and 2 more figures