Table of Contents
Fetching ...

TX-Digital Twin: Visualizing Supercomputer GPU Performance Data Stream

Elena Baskakova, William Bergeron, Matthew Hubbell, Hayden Jananthan, Jeremy Kepner

Abstract

Supercomputers are complex, dynamic systems that serve thousands of users and are built with thousands of compute nodes. Due to the vast amounts of system and performance data needed to accurately capture their status, supercomputers require complex methods to monitor, maintain, and optimize. Data visualization is a powerful technique for overseeing these large streams of data in an easily interpretable way. The MIT Lincoln Laboratory Supercomputing Center (LLSC) enables effective monitoring through combining 3D gaming technology with compound data streams in the TX-Digital Twin, a 3D simulation of the supercomputer. The TX-Digital Twin offers both live and historical data, in visual and text formats, and tracks a multitude of revealing performance metrics. Recent increasing interest in GPU-accelerated computing has driven a need for monitoring and maintenance of GPU-accelerated resources in supercomputers. In this paper, we build on our previous solution by integrating the visualization of additional GPU metrics, such as GPU memory usage, temperature, and power draw, into the TX-Digital Twin. Using techniques in draw call optimization, we add clear and effective displays of the new metrics while keeping the effects on performance minimal.

TX-Digital Twin: Visualizing Supercomputer GPU Performance Data Stream

Abstract

Supercomputers are complex, dynamic systems that serve thousands of users and are built with thousands of compute nodes. Due to the vast amounts of system and performance data needed to accurately capture their status, supercomputers require complex methods to monitor, maintain, and optimize. Data visualization is a powerful technique for overseeing these large streams of data in an easily interpretable way. The MIT Lincoln Laboratory Supercomputing Center (LLSC) enables effective monitoring through combining 3D gaming technology with compound data streams in the TX-Digital Twin, a 3D simulation of the supercomputer. The TX-Digital Twin offers both live and historical data, in visual and text formats, and tracks a multitude of revealing performance metrics. Recent increasing interest in GPU-accelerated computing has driven a need for monitoring and maintenance of GPU-accelerated resources in supercomputers. In this paper, we build on our previous solution by integrating the visualization of additional GPU metrics, such as GPU memory usage, temperature, and power draw, into the TX-Digital Twin. Using techniques in draw call optimization, we add clear and effective displays of the new metrics while keeping the effects on performance minimal.

Paper Structure

This paper contains 11 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: TX-Digital Twin full system view with 3D Graphics and text
  • Figure 2: In-game stack of GPU-accelerated nodes in v0.6.5 (left) compared to v0.7.2 (middle); simulated stack of nodes showing varying metric levels (right).
  • Figure 3: Total CPU Time Per Frame During Program Runs. Data represents samples from 100 frames from three runs of each version.
  • Figure 4: Median value across five independent measurements of median timing of RenderLoop on the main and render threads.