Table of Contents
Fetching ...

SysOM-AI: Continuous Cross-Layer Performance Diagnosis for Production AI Training

Yusheng Zheng, Wenan Mao, Shuyi Cheng, Fuqiu Feng, Guangshui Li, Zhaoyan Liao, Yongzhuo Huang, Zhenwei Xiao, Yuqing Li, Andi Quinn, Tao Ma

Abstract

Performance diagnosis in production-scale AI training is challenging because subtle OS-level issues can trigger cascading GPU delays and network slowdowns, degrading training efficiency across thousands of GPUs. Existing profiling tools are limited to single system layers, incur prohibitive overhead (10--30%), or lack continuous deployment capabilities, resulting in manual analyses spanning days. We argue that continuous, cross-layer observability enabled by OS-level instrumentation and layered differential diagnosis is necessary to address this gap. We introduce SysOM-AI, a production observability system that continuously integrates CPU stack profiling, GPU kernel tracing, and NCCL event instrumentation via adaptive hybrid stack unwinding and eBPF-based tracing, incurring less than 0.4% overhead. Deployed at Alibaba across over 80,000 GPUs for more than one year, SysOM-AI helped diagnose 94 confirmed production issues, reducing median diagnosis time from days to approximately 10 minutes.

SysOM-AI: Continuous Cross-Layer Performance Diagnosis for Production AI Training

Abstract

Performance diagnosis in production-scale AI training is challenging because subtle OS-level issues can trigger cascading GPU delays and network slowdowns, degrading training efficiency across thousands of GPUs. Existing profiling tools are limited to single system layers, incur prohibitive overhead (10--30%), or lack continuous deployment capabilities, resulting in manual analyses spanning days. We argue that continuous, cross-layer observability enabled by OS-level instrumentation and layered differential diagnosis is necessary to address this gap. We introduce SysOM-AI, a production observability system that continuously integrates CPU stack profiling, GPU kernel tracing, and NCCL event instrumentation via adaptive hybrid stack unwinding and eBPF-based tracing, incurring less than 0.4% overhead. Deployed at Alibaba across over 80,000 GPUs for more than one year, SysOM-AI helped diagnose 94 confirmed production issues, reducing median diagnosis time from days to approximately 10 minutes.

Paper Structure

This paper contains 24 sections, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: SysOM-AI architecture. Each node runs an agent that collects CPU stacks (hrtimer + eBPF), GPU kernel timings (CUDA uprobe), and NCCL collective events (uprobe). Stacks are unwound via adaptive hybrid FP/DWARF, stitched across Python and C++ frames, aggregated in-kernel, and uploaded to a central service for symbol resolution, slow-rank detection, and layered differential diagnosis.
  • Figure 2: Distribution of 2,649 diagnostic events by root-cause category during the six-month evaluation period.
  • Figure 3: Stack unwinding frame accuracy on production AI workloads. FP-only achieves only 5% due to widespread omission of frame pointers. SysOM-AI's hybrid unwinding with centralized symbol resolution achieves 95%.
  • Figure 4: Symbol misattribution from node-side resolution. The function pangu_memcpy_avx512 absorbs samples from many unrelated functions due to sparse symbol entries, producing a fictitious hot spot. Centralized resolution with the full symbol table correctly attributes these addresses.
  • Figure 5: Straggler detection for Case 1. Per-rank NCCL collective entry times show rank 0 entering last, indicating it is the straggler in this 8-rank communication group.
  • ...and 3 more figures