ucTrace: A Multi-Layer Profiling Tool for UCX-driven Communication
Emir Gencer, Mohammad Kefah Taha Issa, Ilyas Turimbetov, James D. Trotter, Didem Unat
TL;DR
ucTrace addresses the need for fine-grained UCX-level profiling across CPU-GPU HPC systems by providing multi-layer attribution from UCX transport events to MPI calls and device traffic. The approach instruments UCX at the UCT/UCP layers, attributes communications to MPI and GPU devices via NVIDIA Compute Sanitizer, and offers an interactive visualization suite including timelines, communication matrices, and device graphs. Demonstrations across UCX eager/rendezvous protocols, Open MPI and MPICH AllReduce, a conjugate gradient solver, and GROMACS MD simulations reveal transport usage, topology-dependent behavior, NUMA effects, and potential bottlenecks. While incurring non-trivial runtime overhead, ucTrace delivers actionable insights for administrators and library developers and is designed to be extended to additional libraries and GPU vendors.
Abstract
UCX is a communication framework that enables low-latency, high-bandwidth communication in HPC systems. With its unified API, UCX facilitates efficient data transfers across multi-node CPU-GPU clusters. UCX is widely used as the transport layer for MPI, particularly in GPU-aware implementations. However, existing profiling tools lack fine-grained communication traces at the UCX level, do not capture transport-layer behavior, or are limited to specific MPI implementations. To address these gaps, we introduce ucTrace, a novel profiler that exposes and visualizes UCX-driven communication in HPC environments. ucTrace provides insights into MPI workflows by profiling message passing at the UCX level, linking operations between hosts and devices (e.g., GPUs and NICs) directly to their originating MPI functions. Through interactive visualizations of process- and device-specific interactions, ucTrace helps system administrators, library and application developers optimize performance and debug communication patterns in large-scale workloads. We demonstrate ucTrace's features through a wide range of experiments including MPI point-to-point behavior under different UCX settings, Allreduce comparisons across MPI libraries, communication analysis of a linear solver, NUMA binding effects, and profiling of GROMACS MD simulations with GPU acceleration at scale. ucTrace is publicly available at https://github.com/ParCoreLab/ucTrace.
