Table of Contents
Fetching ...

SMART: A Surrogate Model for Predicting Application Runtime in Dragonfly Systems

Xin Wang, Pietro Lodi Rizzini, Sourav Medya, Zhiling Lan

TL;DR

Dragonfly interconnect workloads suffer dynamic interference, making high-fidelity PDES expensive. SMART combines a GNN encoder for spatial topology, a temporal transformer for time dynamics, and a Time-LLM forecast module to predict next-iteration runtimes, with online tuning to adapt to changing traffic. On data from a 1,056-node Dragonfly system, SMART outperforms all baselines in forecasting accuracy and achieves a mean inference time of about $0.515$ seconds, enabling real-time or near-real-time hybrid simulations. This surrogate model significantly reduces simulation time while preserving accuracy, improving decision-making for routing, congestion control, and resource allocation in Dragonfly-based HPC systems.

Abstract

The Dragonfly network, with its high-radix and low-diameter structure, is a leading interconnect in high-performance computing. A major challenge is workload interference on shared network links. Parallel discrete event simulation (PDES) is commonly used to analyze workload interference. However, high-fidelity PDES is computationally expensive, making it impractical for large-scale or real-time scenarios. Hybrid simulation that incorporates data-driven surrogate models offers a promising alternative, especially for forecasting application runtime, a task complicated by the dynamic behavior of network traffic. We present \ourmodel, a surrogate model that combines graph neural networks (GNNs) and large language models (LLMs) to capture both spatial and temporal patterns from port level router data. \ourmodel outperforms existing statistical and machine learning baselines, enabling accurate runtime prediction and supporting efficient hybrid simulation of Dragonfly networks.

SMART: A Surrogate Model for Predicting Application Runtime in Dragonfly Systems

TL;DR

Dragonfly interconnect workloads suffer dynamic interference, making high-fidelity PDES expensive. SMART combines a GNN encoder for spatial topology, a temporal transformer for time dynamics, and a Time-LLM forecast module to predict next-iteration runtimes, with online tuning to adapt to changing traffic. On data from a 1,056-node Dragonfly system, SMART outperforms all baselines in forecasting accuracy and achieves a mean inference time of about seconds, enabling real-time or near-real-time hybrid simulations. This surrogate model significantly reduces simulation time while preserving accuracy, improving decision-making for routing, congestion control, and resource allocation in Dragonfly-based HPC systems.

Abstract

The Dragonfly network, with its high-radix and low-diameter structure, is a leading interconnect in high-performance computing. A major challenge is workload interference on shared network links. Parallel discrete event simulation (PDES) is commonly used to analyze workload interference. However, high-fidelity PDES is computationally expensive, making it impractical for large-scale or real-time scenarios. Hybrid simulation that incorporates data-driven surrogate models offers a promising alternative, especially for forecasting application runtime, a task complicated by the dynamic behavior of network traffic. We present \ourmodel, a surrogate model that combines graph neural networks (GNNs) and large language models (LLMs) to capture both spatial and temporal patterns from port level router data. \ourmodel outperforms existing statistical and machine learning baselines, enabling accurate runtime prediction and supporting efficient hybrid simulation of Dragonfly networks.

Paper Structure

This paper contains 25 sections, 6 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Topology of a 1,056-node 1D Dragonfly network.
  • Figure 2: Hybrid PDES-surrogate simulation dddas.
  • Figure 3: The architecture of our proposed model, Smart. The GNN Encoder (top) generates the node embeddings while the Transformer helps to produce node representations over time. The lower part shows the LLM-based component. The outputs of each component is concatenated node-wise and fed into a Linear layer for the final prediction.
  • Figure 4: Effect of different hyper-parameters on MAPE for the different Datasets and node assignment combinations: (a) Path Length, (b) the number of GCN layers, and (c) the number of LLM hidden layer. The variations are not significant given that the MAPE varies between 3.1 and 3.8 the best baseline has a MAPE of 6.09.