Table of Contents
Fetching ...

Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs

Marco Kurzynski, Shaizeen Aga, Di Wu

TL;DR

The paper identifies Lit Silicon, a dynamic coupling between thermally induced straggling and concurrent computation and communication (C3), as a key driver of node-level performance variation in single-node multi-GPU LLM training. It develops analytical models for performance and power, demonstrating that aligning GPU frequencies can mitigate the worst of the variation, and proposes lightweight, node-level power-capping strategies to detect and neutralize Lit Silicon with minimal overhead. Empirical evaluation on AMD Instinct MI300X systems with Llama and Mistral workloads shows up to ~6% performance and ~4% power improvements, with considerable potential for coast-to-coast datacenter savings. The solution is designed to be easily adoptable as an additional node-level power-management layer, complementary to existing GPU- and cluster-level controls, and broadly applicable to both training and inference workloads.

Abstract

GPU systems are increasingly powering modern datacenters at scale. Despite being highly performant, GPU systems suffer from performance variation at the node and cluster levels. Such performance variation significantly impacts both high-performance computing and artificial intelligence workloads, such as cutting-edge large language models (LLMs). We analyze the performance of a single-node multi-GPU system running LLM training, and observe that the kernel-level performance variation is highly correlated with concurrent computation communication (C3), a technique to overlap computation and communication across GPUs for performance gains. We then take a further step to reason that thermally induced straggling coupling with C3 impacts performance variation, coined as the Lit Silicon effect. Lit Silicon describes that in a multi-GPU node, thermal imbalance across GPUs introduces node-level straggler GPUs, which in turn slow down the leader GPUs. Lit Silicon leads to node-level performance variation and inefficiency, impacting the entire datacenter from the bottom up. We propose analytical performance and power models for Lit Silicon, to understand the potential system-level gains. We further design simple detection and mitigation techniques to effectively address the Lit Silicon problem, and evaluate three different power management solutions, including power optimization under GPU thermal design power, performance optimization under node-level GPU power capping, and performance optimization under node-level CPU power sloshing. We conduct experiments on two workloads on two AMD InstinctTM MI300X GPU systems under two LLM training frameworks, and observe up to 6% performance and 4% power improvements, potentially saving hundreds of millions of dollars in datacenters. Our solution is almost free lunch and can be effortlessly adopted in datacenters as a new node-level power management layer.

Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs

TL;DR

The paper identifies Lit Silicon, a dynamic coupling between thermally induced straggling and concurrent computation and communication (C3), as a key driver of node-level performance variation in single-node multi-GPU LLM training. It develops analytical models for performance and power, demonstrating that aligning GPU frequencies can mitigate the worst of the variation, and proposes lightweight, node-level power-capping strategies to detect and neutralize Lit Silicon with minimal overhead. Empirical evaluation on AMD Instinct MI300X systems with Llama and Mistral workloads shows up to ~6% performance and ~4% power improvements, with considerable potential for coast-to-coast datacenter savings. The solution is designed to be easily adoptable as an additional node-level power-management layer, complementary to existing GPU- and cluster-level controls, and broadly applicable to both training and inference workloads.

Abstract

GPU systems are increasingly powering modern datacenters at scale. Despite being highly performant, GPU systems suffer from performance variation at the node and cluster levels. Such performance variation significantly impacts both high-performance computing and artificial intelligence workloads, such as cutting-edge large language models (LLMs). We analyze the performance of a single-node multi-GPU system running LLM training, and observe that the kernel-level performance variation is highly correlated with concurrent computation communication (C3), a technique to overlap computation and communication across GPUs for performance gains. We then take a further step to reason that thermally induced straggling coupling with C3 impacts performance variation, coined as the Lit Silicon effect. Lit Silicon describes that in a multi-GPU node, thermal imbalance across GPUs introduces node-level straggler GPUs, which in turn slow down the leader GPUs. Lit Silicon leads to node-level performance variation and inefficiency, impacting the entire datacenter from the bottom up. We propose analytical performance and power models for Lit Silicon, to understand the potential system-level gains. We further design simple detection and mitigation techniques to effectively address the Lit Silicon problem, and evaluate three different power management solutions, including power optimization under GPU thermal design power, performance optimization under node-level GPU power capping, and performance optimization under node-level CPU power sloshing. We conduct experiments on two workloads on two AMD InstinctTM MI300X GPU systems under two LLM training frameworks, and observe up to 6% performance and 4% power improvements, potentially saving hundreds of millions of dollars in datacenters. Our solution is almost free lunch and can be effortlessly adopted in datacenters as a new node-level power management layer.

Paper Structure

This paper contains 29 sections, 15 equations, 15 figures, 3 tables, 3 algorithms.

Figures (15)

  • Figure 1: Overview of this paper. We start from the performance variation in a multi-GPU training, identify the Lit Silicon effect as a major contributor, and propose solutions to address this effect.
  • Figure 2: Concurrent computation and communication in FSDP. vec: vector operations. f_/b_: forward/backward. qkv_ip: input projection GEMM of Q/K/V tensors. attn: attention. fa: flash attention. op: output projection GEMM. mlp: multi-layer perceptron. gp/dp/up: gate/down/up projection GEMM.
  • Figure 3: Comparison between the overlap ratio and the kernel duration for Llama 3.1 8B training over three training iterations. Each line represents a unique GPU across time (x axis), and each sample in a line is for a unique layer or kernel. The red line marks the straggler GPU, and the gray lines denote the leader GPUs. Default settings from Table \ref{['tab:sens_study']} are used.
  • Figure 4: Correlation between overlap ratio and kernel duration of kernels across GPUs (numbered). f_/b_: forward/backward. qkv_ip: input projection GEMM of Q/K/V tensors. attn: attention. fa: flash attention. op: output projection GEMM. n: normalization. mlp: multi-layer perceptron. gp: gate projection GEMM. dp: down projection GEMM. up: up projection GEMM. Default settings from Table \ref{['tab:sens_study']} are used.
  • Figure 5: Temperature and frequency over three training iterations. Both the temperature and frequency are normalized to the lowest value. Default settings from Table \ref{['tab:sens_study']} are used.
  • ...and 10 more figures