Table of Contents
Fetching ...

BandPilot: Towards Performance- and Contention-Aware GPU Dispatching in AI Clusters

Kunming Zhang, Hanlong Liao, Junyu Xue, Deke Guo, Guoming Tang

TL;DR

BandPilot tackles the problem of selecting concrete GPU subsets in multi-tenant AI clusters where interconnect contention and heterogeneous fabrics sharply limit end-to-end NCCL bandwidth. It introduces a hierarchical bandwidth surrogate, built atop sparse NCCL measurements, plus a contention-aware predictor to estimate B(S,τ) under current cluster load, and a fast hybrid search that efficiently navigates the combinatorial allocation space. The key contributions are the problem formulation for contention-aware GPU dispatching, the data-efficient hierarchical Transformer surrogate, the contention-predictor that accounts for co-located traffic, and the hybrid search algorithm that yields near-optimal allocations in real time. Empirically, BandPilot achieves 92–97% of the best-found bandwidth and improves average efficiency by about 20–40% over topology-compactness baselines in a 32-GPU H100 setup and heterogeneous simulations, demonstrating practical impact for large AI clusters.

Abstract

Modern multi-tenant AI clusters are increasingly communication-bound, driven by high-volume and multi-round GPU-to-GPU collective communication. Consequently, the GPU dispatcher's choice of a physical GPU subset for each tenant largely determines the job's effective collective bandwidth and thus its performance ceiling. Existing dispatchers predominantly rely on static, topology-aware heuristics that prioritize GPU resource compactness, assuming that minimizing physical distance maximizes communication bandwidth. However, we reveal that this assumption often fails due to complex system-level bottlenecks, such as non-linear NIC saturation and inter-node link heterogeneity.This paper presents BandPilot, a performance- and contention-aware GPU dispatching primitive that optimizes effective collective bandwidth for multi-tenant AI clusters. Specifically, BandPilot learns a data-efficient bandwidth model from sparse NCCL measurements via a hierarchical design. Guided by the model, a fast hybrid search combines an equilibrium-driven constructor with a pruned elimination search to navigate the combinatorial allocation space in real time. To account for multi-tenant interference, BandPilot virtually merges a candidate allocation with co-located cross-host jobs to conservatively estimate shared bottleneck capacity and predict contention-degraded bandwidth. Across a 32-GPU H100 cluster and heterogeneous simulations, BandPilot achieves 92-97% bandwidth efficiency relative to the best-found reference, improving average efficiency by 20-40% over topology-compactness heuristics.

BandPilot: Towards Performance- and Contention-Aware GPU Dispatching in AI Clusters

TL;DR

BandPilot tackles the problem of selecting concrete GPU subsets in multi-tenant AI clusters where interconnect contention and heterogeneous fabrics sharply limit end-to-end NCCL bandwidth. It introduces a hierarchical bandwidth surrogate, built atop sparse NCCL measurements, plus a contention-aware predictor to estimate B(S,τ) under current cluster load, and a fast hybrid search that efficiently navigates the combinatorial allocation space. The key contributions are the problem formulation for contention-aware GPU dispatching, the data-efficient hierarchical Transformer surrogate, the contention-predictor that accounts for co-located traffic, and the hybrid search algorithm that yields near-optimal allocations in real time. Empirically, BandPilot achieves 92–97% of the best-found bandwidth and improves average efficiency by about 20–40% over topology-compactness baselines in a 32-GPU H100 setup and heterogeneous simulations, demonstrating practical impact for large AI clusters.

Abstract

Modern multi-tenant AI clusters are increasingly communication-bound, driven by high-volume and multi-round GPU-to-GPU collective communication. Consequently, the GPU dispatcher's choice of a physical GPU subset for each tenant largely determines the job's effective collective bandwidth and thus its performance ceiling. Existing dispatchers predominantly rely on static, topology-aware heuristics that prioritize GPU resource compactness, assuming that minimizing physical distance maximizes communication bandwidth. However, we reveal that this assumption often fails due to complex system-level bottlenecks, such as non-linear NIC saturation and inter-node link heterogeneity.This paper presents BandPilot, a performance- and contention-aware GPU dispatching primitive that optimizes effective collective bandwidth for multi-tenant AI clusters. Specifically, BandPilot learns a data-efficient bandwidth model from sparse NCCL measurements via a hierarchical design. Guided by the model, a fast hybrid search combines an equilibrium-driven constructor with a pruned elimination search to navigate the combinatorial allocation space in real time. To account for multi-tenant interference, BandPilot virtually merges a candidate allocation with co-located cross-host jobs to conservatively estimate shared bottleneck capacity and predict contention-degraded bandwidth. Across a 32-GPU H100 cluster and heterogeneous simulations, BandPilot achieves 92-97% bandwidth efficiency relative to the best-found reference, improving average efficiency by 20-40% over topology-compactness heuristics.

Paper Structure

This paper contains 45 sections, 21 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Effective NCCL (NVIDIA Collective Communications Library) all-reduce bandwidth on a real-world H100 cluster. For an 8-GPU request, a balanced 4+4 allocation across two nodes delivers over $2.2\times$ the bandwidth of an unbalanced 6+2 allocation favored by existing topology-aware dispatchers.
  • Figure 2: Peer-to-peer bandwidth measurements on 8 NVIDIA RTX 4090 GPUs. The P2P bandwidth between proximal GPUs (e.g., GPU 0 and GPU 1) can be lower than that between more remote pairs (e.g., GPU 0 and GPU 7), illustrating anti-locality effects.
  • Figure 3: Architectural overview of the BandPilot system. The left path depicts a conventional scheduler, whose topology-driven decisions can result in unbalanced allocations and suboptimal bandwidth. The right main workflow illustrates BandPilot's closed-loop, data-driven approach, which identifies near-optimal, balanced allocations.
  • Figure 4: The Hierarchical Prediction Strategy. Instead of a single monolithic model, we use precise, measured lookups for intra-host bandwidth and a lightweight Transformer to predict the complex inter-host communication dynamics, significantly reducing the learning burden.
  • Figure 5: Data efficiency and predictive accuracy of the Hierarchical Transformer surrogate model. The model achieves high accuracy ($R^2 > 0.95$ and $MAPE < 5\%$) across 2 cluster types with only 250 training samples. This confirms the model's ability to generalize from sparse data.
  • ...and 4 more figures