Table of Contents
Fetching ...

XPUTimer: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale

Weihao Cui, Ji Zhang, Han Zhao, Chao Liu, Wenhao Zhang, Jian Sha, Quan Chen, Bingsheng He, Minyi Guo

TL;DR

XPUTimer tackles the challenging problem of diagnosing anomalies in large-scale distributed LLM training by introducing a real-time, holistic framework that combines lightweight per-process tracing with a fast diagnostic engine. It selectively instruments critical Python APIs and GPU kernels and uses intra-kernel tracing alongside aggregated metrics like issue latency distribution and void percentage to identify both obvious and obscured slowdowns, as well as runtime errors, all in a backbone-agnostic and hardware-extensible manner. The framework demonstrates ultra-low overhead (e.g., around $0.43\%$ latency) and compact trace logs ($\approx1.5\text{MB}$ per GPU) in extensive evaluations and is deployed across over $6{,}000$ GPUs for eight months in Ant Group, yielding actionable insights and case studies for algorithm, infrastructure, and operations teams. Overall, XPUTimer enables real-time, cross-stack attribution of anomalies, reduces diagnostic complexity, and is open-sourced to promote broader adoption and extension to other NPUs and backbones.

Abstract

The rapid proliferation of large language models has driven the need for efficient GPU training clusters. However, ensuring high-performance training in these clusters is challenging due to the complexity of software-hardware interactions and the frequent occurrence of training anomalies. Since existing diagnostic tools are narrowly tailored to specific issues, there are gaps in their ability to address anomalies spanning the entire training stack. In response, we introduce XPUTimer, a real-time diagnostic framework designed for distributed LLM training at scale. XPUTimer first integrates a lightweight tracing daemon to monitor key code segments with minimal overhead. Additionally, it features a diagnostic engine that employs novel intra-kernel tracing and holistic aggregated metrics to efficiently identify and resolve anomalies. Deployment of XPUTimer across 6,000 GPUs over eight months demonstrated significant improvements across the training stack, validating its effectiveness in real-world scenarios.

XPUTimer: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale

TL;DR

XPUTimer tackles the challenging problem of diagnosing anomalies in large-scale distributed LLM training by introducing a real-time, holistic framework that combines lightweight per-process tracing with a fast diagnostic engine. It selectively instruments critical Python APIs and GPU kernels and uses intra-kernel tracing alongside aggregated metrics like issue latency distribution and void percentage to identify both obvious and obscured slowdowns, as well as runtime errors, all in a backbone-agnostic and hardware-extensible manner. The framework demonstrates ultra-low overhead (e.g., around latency) and compact trace logs ( per GPU) in extensive evaluations and is deployed across over GPUs for eight months in Ant Group, yielding actionable insights and case studies for algorithm, infrastructure, and operations teams. Overall, XPUTimer enables real-time, cross-stack attribution of anomalies, reduces diagnostic complexity, and is open-sourced to promote broader adoption and extension to other NPUs and backbones.

Abstract

The rapid proliferation of large language models has driven the need for efficient GPU training clusters. However, ensuring high-performance training in these clusters is challenging due to the complexity of software-hardware interactions and the frequent occurrence of training anomalies. Since existing diagnostic tools are narrowly tailored to specific issues, there are gaps in their ability to address anomalies spanning the entire training stack. In response, we introduce XPUTimer, a real-time diagnostic framework designed for distributed LLM training at scale. XPUTimer first integrates a lightweight tracing daemon to monitor key code segments with minimal overhead. Additionally, it features a diagnostic engine that employs novel intra-kernel tracing and holistic aggregated metrics to efficiently identify and resolve anomalies. Deployment of XPUTimer across 6,000 GPUs over eight months demonstrated significant improvements across the training stack, validating its effectiveness in real-world scenarios.

Paper Structure

This paper contains 49 sections, 2 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: The summarized training stack of large-scale training cluster in Ant Group, highlighting XPUTimer’s position.
  • Figure 2: Slowdown of a 1024-GPU training job with Llama2-70B under various single-GPU underclocking configurations. Results are evaluated using both Megatron shoeybiMegatronLMTraining and FSDP zhaoPyTorchFSDP.
  • Figure 3: Architecture overview of XPUTimer.
  • Figure 4: Instrumented key code segments in XPUTimer.
  • Figure 5: Intercepting and timing the training in the background.
  • ...and 8 more figures