Table of Contents
Fetching ...

m4: A Learned Flow-level Network Simulator

Chenning Li, Anton A. Zabreyko, Arash Nasr-Esfahany, Kevin Zhao, Prateesh Goyal, Mohammad Alizadeh, Thomas Anderson

TL;DR

Flow-level simulators scale to data-center networks but sacrifice accuracy by omitting packet-level effects. m4 tackles this gap with a learned flow-level model that decomposes state transitions into spatial (Graph Neural Network) and temporal (GRU) components, augmented by dense per-event supervision signals such as remaining flow size and queue length, to capture congestion control and queuing dynamics. Trained on ns-3 data and evaluated on large-scale fat-tree topologies, m4 achieves up to 104× speedups over packet-level simulation and reduces per-flow FCT slowdown errors by about 45.3% (mean) and 53.0% (p90) relative to traditional flow-level models, while generalizing across workloads and CC schemes. This approach enables accurate, scalable, closed-loop network simulations suitable for design-space exploration and application-level performance forecasting, with practical impact on data-center network design and ML training/inference workloads.

Abstract

Flow-level simulation is widely used to model large-scale data center networks due to its scalability. Unlike packet-level simulators that model individual packets, flow-level simulators abstract traffic as continuous flows with dynamically assigned transmission rates. While this abstraction enables orders-of-magnitude speedup, it is inaccurate by omitting critical packet-level effects such as queuing, congestion control, and retransmissions. We present m4, an accurate and scalable flow-level simulator that uses machine learning to learn the dynamics of the network of interest. At the core of m4 lies a novel ML architecture that decomposes state transition computations into distinct spatial and temporal components, each represented by a suitable neural network. To efficiently learn the underlying flow-level dynamics, m4 adds dense supervision signals by predicting intermediate network metrics such as remaining flow size and queue length during training. m4 achieves a speedup of up to 104$\times$ over packet-level simulation. Relative to a traditional flow-level simulation, m4 reduces per-flow estimation errors by 45.3% (mean) and 53.0% (p90). For closed-loop applications, m4 accurately predicts network throughput under various congestion control schemes and workloads.

m4: A Learned Flow-level Network Simulator

TL;DR

Flow-level simulators scale to data-center networks but sacrifice accuracy by omitting packet-level effects. m4 tackles this gap with a learned flow-level model that decomposes state transitions into spatial (Graph Neural Network) and temporal (GRU) components, augmented by dense per-event supervision signals such as remaining flow size and queue length, to capture congestion control and queuing dynamics. Trained on ns-3 data and evaluated on large-scale fat-tree topologies, m4 achieves up to 104× speedups over packet-level simulation and reduces per-flow FCT slowdown errors by about 45.3% (mean) and 53.0% (p90) relative to traditional flow-level models, while generalizing across workloads and CC schemes. This approach enables accurate, scalable, closed-loop network simulations suitable for design-space exploration and application-level performance forecasting, with practical impact on data-center network design and ML training/inference workloads.

Abstract

Flow-level simulation is widely used to model large-scale data center networks due to its scalability. Unlike packet-level simulators that model individual packets, flow-level simulators abstract traffic as continuous flows with dynamically assigned transmission rates. While this abstraction enables orders-of-magnitude speedup, it is inaccurate by omitting critical packet-level effects such as queuing, congestion control, and retransmissions. We present m4, an accurate and scalable flow-level simulator that uses machine learning to learn the dynamics of the network of interest. At the core of m4 lies a novel ML architecture that decomposes state transition computations into distinct spatial and temporal components, each represented by a suitable neural network. To efficiently learn the underlying flow-level dynamics, m4 adds dense supervision signals by predicting intermediate network metrics such as remaining flow size and queue length during training. m4 achieves a speedup of up to 104 over packet-level simulation. Relative to a traditional flow-level simulation, m4 reduces per-flow estimation errors by 45.3% (mean) and 53.0% (p90). For closed-loop applications, m4 accurately predicts network throughput under various congestion control schemes and workloads.

Paper Structure

This paper contains 23 sections, 6 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: m4 mimics the computational structure of flowSim but replaces its components with learnable modules.
  • Figure 2: m4’s workflow: Inputs (yellow boxes), outputs (red boxes), intermediate components (white boxes).
  • Figure 3: m4 adds "dense" supervision during training by querying intermediate network states for "remaining size" and "queue length". Dashed boxes represent subsequent simulations triggered by new flow-level events.
  • Figure 4: m4 converts (a) a network snapshot in time to a (b) bipartite graph and uses GNN to capture spatial dynamics.
  • Figure 5: m4's implementation
  • ...and 8 more figures