Table of Contents
Fetching ...

Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

Amel Fatima, Tuan Ta, Bradford M. Beckmann

Abstract

Distributed ML workloads rely heavily on collective communication across multi-GPU, multi-node systems. Emerging scale-up fabrics, such as NVLink and UALink, enable direct memory access across nodes but introduce a critical destination-side translation step: translating Network Physical Addresses (NPAs) to System Physical Addresses (SPAs), which we term Reverse Translation (Reverse Address Translation). Despite its importance, the performance impact of Reverse Address Translation remains poorly understood. In this work, we present the first systematic study of Reverse Address Translation in large-scale GPU clusters. Using an extended ASTRA-sim framework with Omnet++ as the network backend, we model Link MMUs and Link TLBs and evaluate their effect on All-to-All collective communication across varying input sizes and GPU counts. Our analysis shows that cold TLB misses dominate latency for small, latency-sensitive collectives, causing up to 1.4x performance degradation, while larger collectives benefit from warmed caches and experience diminishing returns from over sized TLBs. Based on these observations, we propose two avenues for optimization: fused pre-translation kernels that overlap Reverse Address Translation with computation and software-guided TLB prefetching to proactively populate likely-needed entries. These techniques aim to hide translation latency, particularly for small collectives, improving throughput and scalability for inference workloads. Our study establishes a foundation for designing efficient destination-side translation mechanisms in large-scale multi-GPU systems.

Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

Abstract

Distributed ML workloads rely heavily on collective communication across multi-GPU, multi-node systems. Emerging scale-up fabrics, such as NVLink and UALink, enable direct memory access across nodes but introduce a critical destination-side translation step: translating Network Physical Addresses (NPAs) to System Physical Addresses (SPAs), which we term Reverse Translation (Reverse Address Translation). Despite its importance, the performance impact of Reverse Address Translation remains poorly understood. In this work, we present the first systematic study of Reverse Address Translation in large-scale GPU clusters. Using an extended ASTRA-sim framework with Omnet++ as the network backend, we model Link MMUs and Link TLBs and evaluate their effect on All-to-All collective communication across varying input sizes and GPU counts. Our analysis shows that cold TLB misses dominate latency for small, latency-sensitive collectives, causing up to 1.4x performance degradation, while larger collectives benefit from warmed caches and experience diminishing returns from over sized TLBs. Based on these observations, we propose two avenues for optimization: fused pre-translation kernels that overlap Reverse Address Translation with computation and software-guided TLB prefetching to proactively populate likely-needed entries. These techniques aim to hide translation latency, particularly for small collectives, improving throughput and scalability for inference workloads. Our study establishes a foundation for designing efficient destination-side translation mechanisms in large-scale multi-GPU systems.

Paper Structure

This paper contains 17 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: A multi-node multi-gpu system connected over a UALink network (For clarity, certain of the UALink Links have been shown while the other links are omitted).
  • Figure 2: Reverse Address Translation of a Network Physical Address (NPA) to a System Physical Address (SPA) at the target GPU for inter-node accesses.
  • Figure 3: Our baseline Reverse Address Translation hierarchy for performing Reverse Address Translation at the Target GPU node.
  • Figure 4: Performance overhead of Reverse Address Translation, normalized to an ideal configuration with zero Reverse Address Translation overhead, evaluated on systems with eight and up to 64 GPUs with AlltoAll collective sizes ranging from 1 MB to 4 GB.
  • Figure 5: Average Reverse Address Translation latency per request, evaluated on systems with eight and up to 64 GPUs with AlltoAll collective sizes ranging from 1 MB to 4 GB.
  • ...and 6 more figures