Table of Contents
Fetching ...

MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies

Stepan Vanecek, Manuel Walter Mussbacher, Dominik Groessler, Urvij Saroliya, Martin Schulz

TL;DR

MT4G tackles the lack of standardized GPU topology information by delivering an open-source, vendor-agnostic tool that auto-discovers compute and memory topologies for NVIDIA and AMD GPUs. It combines API data with a large microbenchmark suite and Kolmogorov-Smirnov–based change-point detection to reliably estimate topology attributes such as cache sizes, latencies, bandwidths, and fetch granularity. The approach is validated across ten GPUs and demonstrated in three workflows: GPU performance modeling, GPUscout bottleneck analysis, and dynamic resource partitioning with sys-sage, highlighting MT4G's practical impact on performance optimization and resource management in HPC/AI workloads. By providing automated, portable topology reports, MT4G enables more accurate modeling, tuning, and runtime configuration decisions across vendor platforms.

Abstract

Understanding GPU topology is essential for performance-related tasks in HPC or AI. Yet, unlike for CPUs with tools like hwloc, GPU information is hard to come by, incomplete, and vendor-specific. In this work, we address this gap and present MT4G, an open-source and vendor-agnostic tool that automatically discovers GPU compute and memory topologies and configurations, including cache sizes, bandwidths, and physical layouts. MT4G combines existing APIs with a suite of over 50 microbenchmarks, applying statistical methods, such as the Kolmogorov-Smirnov test, to automatically and reliably identify otherwise programmatically unavailable topological attributes. We showcase MT4G's universality on ten different GPUs and demonstrate its impact through integration into three workflows: GPU performance modeling, GPUscout bottleneck analysis, and dynamic resource partitioning. These scenarios highlight MT4G's role in understanding system performance and characteristics across NVIDIA and AMD GPUs, providing an automated, portable solution for modern HPC and AI systems.

MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies

TL;DR

MT4G tackles the lack of standardized GPU topology information by delivering an open-source, vendor-agnostic tool that auto-discovers compute and memory topologies for NVIDIA and AMD GPUs. It combines API data with a large microbenchmark suite and Kolmogorov-Smirnov–based change-point detection to reliably estimate topology attributes such as cache sizes, latencies, bandwidths, and fetch granularity. The approach is validated across ten GPUs and demonstrated in three workflows: GPU performance modeling, GPUscout bottleneck analysis, and dynamic resource partitioning with sys-sage, highlighting MT4G's practical impact on performance optimization and resource management in HPC/AI workloads. By providing automated, portable topology reports, MT4G enables more accurate modeling, tuning, and runtime configuration decisions across vendor platforms.

Abstract

Understanding GPU topology is essential for performance-related tasks in HPC or AI. Yet, unlike for CPUs with tools like hwloc, GPU information is hard to come by, incomplete, and vendor-specific. In this work, we address this gap and present MT4G, an open-source and vendor-agnostic tool that automatically discovers GPU compute and memory topologies and configurations, including cache sizes, bandwidths, and physical layouts. MT4G combines existing APIs with a suite of over 50 microbenchmarks, applying statistical methods, such as the Kolmogorov-Smirnov test, to automatically and reliably identify otherwise programmatically unavailable topological attributes. We showcase MT4G's universality on ten different GPUs and demonstrate its impact through integration into three workflows: GPU performance modeling, GPUscout bottleneck analysis, and dynamic resource partitioning. These scenarios highlight MT4G's role in understanding system performance and characteristics across NVIDIA and AMD GPUs, providing an automated, portable solution for modern HPC and AI systems.

Paper Structure

This paper contains 43 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Pointer-chase with different array sizes on a simplified 2-way cache depiction. If the array fits into the cache, the load causes a cache hit; otherwise, a cache miss occurs. Around the boundary, we may experience both hits and misses, as is the case in the middle example.
  • Figure 2: NVIDIA V100 CL1, AMD MI300 vL1 and AMD MI210 sL1d example sizes showing raw data analysis (blue,orange,green) and reduction value(violet). Reduction presents the change point (vertical dashed line) most clearly (maximum is prone to outliers).
  • Figure 3: The core of the Amount benchmarks: the two cores evict each others' data if they fetch from the same segment; otherwise not. At the top is a case with 2 segments, at the bottom with one.
  • Figure 4: GPUscout-GUI Memory Component visualization, containing memory element sizes. stuckenberger2025.
  • Figure 5: Streaming read throughput over an array with varying sizes on NVIDIA A100 with differnent MIG settings. Vertical lines mark L2 cache size provided by sys-sage vanecek2024sys.