Table of Contents
Fetching ...

LLload: An Easy-to-Use HPC Utilization Tool

Chansup Byun, Albert Reuther, Julie Mullen, LaToya Anderson, William Arcand, Bill Bergeron, David Bestor, Alexander Bonn, Daniel Burrill, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Piotr Luszczek, Peter Michaleas, Lauren Milechin, Guillermo Morales, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner

TL;DR

The paper tackles the challenge of understanding and improving HPC resource utilization amid diverse user workloads and complex systems. It introduces LLload, a lightweight per-user monitoring tool with a real-time CLI that collects a compact set of metrics (CPU, GPU, memory) and supports features like --all, -t N, and -n to aid engineers and users in diagnosing inefficient submissions and overloaded nodes. A weekly analytics pipeline archives 15-minute LLload snapshots as TSVs and analyzes node-hours-based low/high load patterns using a Matlab/Octave workflow, enabling targeted user feedback and guidance (e.g., GPU overloading strategies) to improve utilization. The approach demonstrates potential gains in GPU utilization and throughput in some scenarios, while providing practical, actionable insights to users and system engineers to optimize HPC resource usage and efficiency. Key contributions include the per-user, real-time monitoring capability, the integration of weekly analytics for behavior-driven optimization, and the demonstration that correctable submission issues can significantly boost resource utilization; the work has practical impact for HPC centers seeking to enhance utilization with minimal tooling overhead.

Abstract

The increasing use and cost of high performance computing (HPC) requires new easy-to-use tools to enable HPC users and HPC systems engineers to transparently understand the utilization of resources. The MIT Lincoln Laboratory Supercomputing Center (LLSC) has developed a simple command, LLload, to monitor and characterize HPC workloads. LLload plays an important role in identifying opportunities for better utilization of compute resources. LLload can be used to monitor jobs both programmatically and interactively. LLload can characterize users' jobs using various LLload options to achieve better efficiency. This information can be used to inform the user to optimize HPC workloads and improve both CPU and GPU utilization. This includes improvements using judicious oversubscription of the computing resources. Preliminary results suggest significant improvement in GPU utilization and overall throughput performance with GPU overloading in some cases. By enabling users to observe and fix incorrect job submission and/or inappropriate execution setups, LLload can increase the resource usage and improve the overall throughput performance. LLload is a light-weight, easy-to-use tool for both HPC users and HPC systems engineers to monitor HPC workloads to improve system utilization and efficiency.

LLload: An Easy-to-Use HPC Utilization Tool

TL;DR

The paper tackles the challenge of understanding and improving HPC resource utilization amid diverse user workloads and complex systems. It introduces LLload, a lightweight per-user monitoring tool with a real-time CLI that collects a compact set of metrics (CPU, GPU, memory) and supports features like --all, -t N, and -n to aid engineers and users in diagnosing inefficient submissions and overloaded nodes. A weekly analytics pipeline archives 15-minute LLload snapshots as TSVs and analyzes node-hours-based low/high load patterns using a Matlab/Octave workflow, enabling targeted user feedback and guidance (e.g., GPU overloading strategies) to improve utilization. The approach demonstrates potential gains in GPU utilization and throughput in some scenarios, while providing practical, actionable insights to users and system engineers to optimize HPC resource usage and efficiency. Key contributions include the per-user, real-time monitoring capability, the integration of weekly analytics for behavior-driven optimization, and the demonstration that correctable submission issues can significantly boost resource utilization; the work has practical impact for HPC centers seeking to enhance utilization with minimal tooling overhead.

Abstract

The increasing use and cost of high performance computing (HPC) requires new easy-to-use tools to enable HPC users and HPC systems engineers to transparently understand the utilization of resources. The MIT Lincoln Laboratory Supercomputing Center (LLSC) has developed a simple command, LLload, to monitor and characterize HPC workloads. LLload plays an important role in identifying opportunities for better utilization of compute resources. LLload can be used to monitor jobs both programmatically and interactively. LLload can characterize users' jobs using various LLload options to achieve better efficiency. This information can be used to inform the user to optimize HPC workloads and improve both CPU and GPU utilization. This includes improvements using judicious oversubscription of the computing resources. Preliminary results suggest significant improvement in GPU utilization and overall throughput performance with GPU overloading in some cases. By enabling users to observe and fix incorrect job submission and/or inappropriate execution setups, LLload can increase the resource usage and improve the overall throughput performance. LLload is a light-weight, easy-to-use tool for both HPC users and HPC systems engineers to monitor HPC workloads to improve system utilization and efficiency.

Paper Structure

This paper contains 8 sections, 11 figures.

Figures (11)

  • Figure 1: A schematic diagram of the LLload analysis pipeline for a supercomputing system at the MIT Lincoln Laboratory Supercomputing Center.
  • Figure 2: The default output of LLload, which conveys CPU utilization and system memory use.
  • Figure 3: Typical output of LLload with the -g GPU option, which adds GPU utilization and GPU memory information along with CPU information.
  • Figure 4: Typical output of LLload with the --all -g option.
  • Figure 5: An output showing the top 5 compute nodes with the highest CPU loads.
  • ...and 6 more figures