Table of Contents
Fetching ...

Checkpoint and Restart: An Energy Consumption Characterization in Clusters

Marina Moran, Javier Balladini, Dolores Rexachs, Emilio Luque

TL;DR

This study characterizes the energy implications of coordinated checkpoint and restart (C/R) in homogeneous HPC clusters across hardware, software, and system configurations. Using two platforms, DMTCP with optional compression, and NFS configurations, the authors quantify how P and C processor states, problem size, and I/O settings influence power, time, and total energy during C/R. Key findings include energy savings when C states are enabled, significant timing benefits from asynchronous NFS at high frequencies, and opposing effects of compression on checkpoint versus restart energy. The results provide practical guidelines to reduce C/R energy overhead in HPC environments and point to future work on alternative compression methods and fault-tolerance tools.

Abstract

The fault tolerance method currently used in High Performance Computing (HPC) is the rollback-recovery method by using checkpoints. This, like any other fault tolerance method, adds an additional energy consumption to that of the execution of the application. The objective of this work is to determine the factors that affect the energy consumption of the computing nodes on homogeneous cluster, when performing checkpoint and restart operations, on SPMD (Single Program Multiple Data) applications. We have focused on the energetic study of compute nodes, contemplating different configurations of hardware and software parameters. We studied the effect of performance states (states P) and power states (states C) of processors, application problem size, checkpoint software (DMTCP) and distributed file system (NFS) configuration. The results analysis allowed to identify opportunities to reduce the energy consumption of checkpoint and restart operations.

Checkpoint and Restart: An Energy Consumption Characterization in Clusters

TL;DR

This study characterizes the energy implications of coordinated checkpoint and restart (C/R) in homogeneous HPC clusters across hardware, software, and system configurations. Using two platforms, DMTCP with optional compression, and NFS configurations, the authors quantify how P and C processor states, problem size, and I/O settings influence power, time, and total energy during C/R. Key findings include energy savings when C states are enabled, significant timing benefits from asynchronous NFS at high frequencies, and opposing effects of compression on checkpoint versus restart energy. The results provide practical guidelines to reduce C/R energy overhead in HPC environments and point to future work on alternative compression methods and fault-tolerance tools.

Abstract

The fault tolerance method currently used in High Performance Computing (HPC) is the rollback-recovery method by using checkpoints. This, like any other fault tolerance method, adds an additional energy consumption to that of the execution of the application. The objective of this work is to determine the factors that affect the energy consumption of the computing nodes on homogeneous cluster, when performing checkpoint and restart operations, on SPMD (Single Program Multiple Data) applications. We have focused on the energetic study of compute nodes, contemplating different configurations of hardware and software parameters. We studied the effect of performance states (states P) and power states (states C) of processors, application problem size, checkpoint software (DMTCP) and distributed file system (NFS) configuration. The results analysis allowed to identify opportunities to reduce the energy consumption of checkpoint and restart operations.
Paper Structure (17 sections, 12 figures)

This paper contains 17 sections, 12 figures.

Figures (12)

  • Figure 1: Power Dissipation during Checkpoint and Restart.
  • Figure 2: Power Dissipation, Network Bandwidth and CPU utilization during Checkpoint and Restart.
  • Figure 3: Influence of P states on Platform 1.
  • Figure 4: Influence of P states on Platform 2.
  • Figure 5: Influence of C states on power dissipation - Platform 1.
  • ...and 7 more figures