Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs

Marco Kurzynski; Shaizeen Aga; Di Wu

Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs

Marco Kurzynski, Shaizeen Aga, Di Wu

TL;DR

The paper identifies Lit Silicon, a dynamic coupling between thermally induced straggling and concurrent computation and communication (C3), as a key driver of node-level performance variation in single-node multi-GPU LLM training. It develops analytical models for performance and power, demonstrating that aligning GPU frequencies can mitigate the worst of the variation, and proposes lightweight, node-level power-capping strategies to detect and neutralize Lit Silicon with minimal overhead. Empirical evaluation on AMD Instinct MI300X systems with Llama and Mistral workloads shows up to ~6% performance and ~4% power improvements, with considerable potential for coast-to-coast datacenter savings. The solution is designed to be easily adoptable as an additional node-level power-management layer, complementary to existing GPU- and cluster-level controls, and broadly applicable to both training and inference workloads.

Abstract

GPU systems are increasingly powering modern datacenters at scale. Despite being highly performant, GPU systems suffer from performance variation at the node and cluster levels. Such performance variation significantly impacts both high-performance computing and artificial intelligence workloads, such as cutting-edge large language models (LLMs). We analyze the performance of a single-node multi-GPU system running LLM training, and observe that the kernel-level performance variation is highly correlated with concurrent computation communication (C3), a technique to overlap computation and communication across GPUs for performance gains. We then take a further step to reason that thermally induced straggling coupling with C3 impacts performance variation, coined as the Lit Silicon effect. Lit Silicon describes that in a multi-GPU node, thermal imbalance across GPUs introduces node-level straggler GPUs, which in turn slow down the leader GPUs. Lit Silicon leads to node-level performance variation and inefficiency, impacting the entire datacenter from the bottom up. We propose analytical performance and power models for Lit Silicon, to understand the potential system-level gains. We further design simple detection and mitigation techniques to effectively address the Lit Silicon problem, and evaluate three different power management solutions, including power optimization under GPU thermal design power, performance optimization under node-level GPU power capping, and performance optimization under node-level CPU power sloshing. We conduct experiments on two workloads on two AMD InstinctTM MI300X GPU systems under two LLM training frameworks, and observe up to 6% performance and 4% power improvements, potentially saving hundreds of millions of dollars in datacenters. Our solution is almost free lunch and can be effortlessly adopted in datacenters as a new node-level power management layer.

Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs

TL;DR

Abstract

Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)