BandPilot: Towards Performance- and Contention-Aware GPU Dispatching in AI Clusters
Kunming Zhang, Hanlong Liao, Junyu Xue, Deke Guo, Guoming Tang
TL;DR
BandPilot tackles the problem of selecting concrete GPU subsets in multi-tenant AI clusters where interconnect contention and heterogeneous fabrics sharply limit end-to-end NCCL bandwidth. It introduces a hierarchical bandwidth surrogate, built atop sparse NCCL measurements, plus a contention-aware predictor to estimate B(S,τ) under current cluster load, and a fast hybrid search that efficiently navigates the combinatorial allocation space. The key contributions are the problem formulation for contention-aware GPU dispatching, the data-efficient hierarchical Transformer surrogate, the contention-predictor that accounts for co-located traffic, and the hybrid search algorithm that yields near-optimal allocations in real time. Empirically, BandPilot achieves 92–97% of the best-found bandwidth and improves average efficiency by about 20–40% over topology-compactness baselines in a 32-GPU H100 setup and heterogeneous simulations, demonstrating practical impact for large AI clusters.
Abstract
Modern multi-tenant AI clusters are increasingly communication-bound, driven by high-volume and multi-round GPU-to-GPU collective communication. Consequently, the GPU dispatcher's choice of a physical GPU subset for each tenant largely determines the job's effective collective bandwidth and thus its performance ceiling. Existing dispatchers predominantly rely on static, topology-aware heuristics that prioritize GPU resource compactness, assuming that minimizing physical distance maximizes communication bandwidth. However, we reveal that this assumption often fails due to complex system-level bottlenecks, such as non-linear NIC saturation and inter-node link heterogeneity.This paper presents BandPilot, a performance- and contention-aware GPU dispatching primitive that optimizes effective collective bandwidth for multi-tenant AI clusters. Specifically, BandPilot learns a data-efficient bandwidth model from sparse NCCL measurements via a hierarchical design. Guided by the model, a fast hybrid search combines an equilibrium-driven constructor with a pruned elimination search to navigate the combinatorial allocation space in real time. To account for multi-tenant interference, BandPilot virtually merges a candidate allocation with co-located cross-host jobs to conservatively estimate shared bottleneck capacity and predict contention-degraded bandwidth. Across a 32-GPU H100 cluster and heterogeneous simulations, BandPilot achieves 92-97% bandwidth efficiency relative to the best-found reference, improving average efficiency by 20-40% over topology-compactness heuristics.
