Table of Contents
Fetching ...

Fake Runs, Real Fixes -- Analyzing xPU Performance Through Simulation

Ioannis Zarkadas, Amanda Tomlinson, Asaf Cidon, Baris Kasikci, Ofir Weisse

TL;DR

xPU-Shark introduces a record-and-replay framework that repurposes a Golden Reference Model as an ISA-level simulator to perform fine-grained microarchitectural analysis of ML accelerators. By capturing production traces with a step debugger and replaying them in a software-only GRM-based simulator, it yields actionable insights into DMAs, VMEM utilization, and instruction dependencies that traditional profilers miss. The approach identifies unseen inefficiencies in production LLMs and enables optimizations such as a 15% improvement in All-Gather and up to 4.1% reduction in token-generation latency, with broader implications for VMEM planning and DMA scheduling. Its software-only, non-recompilation workflow makes it practical for hyperscalers to deploy across fleets, potentially delivering significant cost and power savings in large-scale model serving.

Abstract

As models become larger, ML accelerators are a scarce resource whose performance must be continually optimized to improve efficiency. Existing performance analysis tools are coarse grained, and fail to capture model performance at the machine-code level. In addition, these tools often do not provide specific recommendations for optimizations. We present xPU-Shark, a fine-grained methodology for analyzing ML models at the machine-code level that provides actionable optimization suggestions. Our core insight is to use a hardware-level simulator, an artifact of the hardware design process that we can re-purpose for performance analysis. xPU-Shark captures traces from production deployments running on accelerators and replays them in a modified microarchitecture simulator to gain low-level insights into the model's performance. We implement xPU-Shark for our in-house accelerator and used it to analyze the performance of several of our production LLMs, revealing several previously-unknown microarchitecture inefficiencies. Leveraging these insights, we optimize a common communication collective by up to 15% and reduce token generation latency by up to 4.1%.

Fake Runs, Real Fixes -- Analyzing xPU Performance Through Simulation

TL;DR

xPU-Shark introduces a record-and-replay framework that repurposes a Golden Reference Model as an ISA-level simulator to perform fine-grained microarchitectural analysis of ML accelerators. By capturing production traces with a step debugger and replaying them in a software-only GRM-based simulator, it yields actionable insights into DMAs, VMEM utilization, and instruction dependencies that traditional profilers miss. The approach identifies unseen inefficiencies in production LLMs and enables optimizations such as a 15% improvement in All-Gather and up to 4.1% reduction in token-generation latency, with broader implications for VMEM planning and DMA scheduling. Its software-only, non-recompilation workflow makes it practical for hyperscalers to deploy across fleets, potentially delivering significant cost and power savings in large-scale model serving.

Abstract

As models become larger, ML accelerators are a scarce resource whose performance must be continually optimized to improve efficiency. Existing performance analysis tools are coarse grained, and fail to capture model performance at the machine-code level. In addition, these tools often do not provide specific recommendations for optimizations. We present xPU-Shark, a fine-grained methodology for analyzing ML models at the machine-code level that provides actionable optimization suggestions. Our core insight is to use a hardware-level simulator, an artifact of the hardware design process that we can re-purpose for performance analysis. xPU-Shark captures traces from production deployments running on accelerators and replays them in a modified microarchitecture simulator to gain low-level insights into the model's performance. We implement xPU-Shark for our in-house accelerator and used it to analyze the performance of several of our production LLMs, revealing several previously-unknown microarchitecture inefficiencies. Leveraging these insights, we optimize a common communication collective by up to 15% and reduce token generation latency by up to 4.1%.

Paper Structure

This paper contains 31 sections, 10 figures, 1 table, 2 algorithms.

Figures (10)

  • Figure 1: Overview of xPU-Shark.
  • Figure 2: A toy example showing a coarse-grained vs fine-grained analysis. Many existing tools provide coarse-grained information at the kernel or HLO level, while deep optimization requires a fine-grained view.
  • Figure 3: Example machine-code instructions for handling memory transfers in an ML model. The ISSUE command starts the DMA, while the WAIT command blocks until it is complete and is typically inserted close to the command that needs the memory. The compiler can hide the DMA latency by inserting instructions between ISSUE and WAIT.
  • Figure 4: Lifetime of a DMA. A DMA is comprised of a base latency (constant) and a transfer latency (variable). DMAs begin with the ISSUE command. If the DMA has not completed by the time the WAIT command executes, the accelerator will stall until the DMA completes. We highlight three distinct scenarios. (1) WAIT comes before the base latency is fulfilled. The DMA incurs stalls first because of the base latency (green) and then because of the transfer latency (purple). (2) WAIT comes after the base latency but before the transfer latency is fulfilled. The DMA incurs stalls only because of the transfer latency (purple). (3) WAIT comes after the DMA finishes. The time between the DMA completion and the WAIT is slack (gray cross pattern).
  • Figure 5: All-Gather DMA Pattern. Memory accesses are performed in two phases, the setup phase, and data transfer phase. Dependencies for the setup phase are shown with red arrows. These dependencies were manually discovered by reading the machine-code, a laborious process.
  • ...and 5 more figures