Table of Contents
Fetching ...

Towards Near-Real-Time Telemetry-Aware Routing with Neural Routing Algorithms

Andreas Boltres, Niklas Freymuth, Benjamin Schichtholz, Michael König, Gerhard Neumann

Abstract

Routing algorithms are crucial for efficient computer network operations, and in many settings they must be able to react to traffic bursts within milliseconds. Live telemetry data can provide informative signals to routing algorithms, and recent work has trained neural networks to exploit such signals for traffic-aware routing. Yet, aggregating network-wide information is subject to communication delays, and existing neural approaches either assume unrealistic delay-free global states, or restrict routers to purely local telemetry. This leaves their deployability in real-world environments unclear. We cast telemetry-aware routing as a delay-aware closed-loop control problem and introduce a framework that trains and evaluates neural routing algorithms, while explicitly modeling communication and inference delays. On top of this framework, we propose LOGGIA, a scalable graph neural routing algorithm that predicts log-space link weights from attributed topology-and-telemetry graphs. It utilizes a data-driven pre-training stage, followed by on-policy Reinforcement Learning. Across synthetic and real network topologies, and unseen mixed TCP/UDP traffic sequences, LOGGIA consistently outperforms shortest-path baselines, whereas neural baselines fail once realistic delays are enforced. Our experiments further suggest that neural routing algorithms like LOGGIA perform best when deployed fully locally, i.e., observing network states and inferring actions at every router individually, as opposed to centralized decision making.

Towards Near-Real-Time Telemetry-Aware Routing with Neural Routing Algorithms

Abstract

Routing algorithms are crucial for efficient computer network operations, and in many settings they must be able to react to traffic bursts within milliseconds. Live telemetry data can provide informative signals to routing algorithms, and recent work has trained neural networks to exploit such signals for traffic-aware routing. Yet, aggregating network-wide information is subject to communication delays, and existing neural approaches either assume unrealistic delay-free global states, or restrict routers to purely local telemetry. This leaves their deployability in real-world environments unclear. We cast telemetry-aware routing as a delay-aware closed-loop control problem and introduce a framework that trains and evaluates neural routing algorithms, while explicitly modeling communication and inference delays. On top of this framework, we propose LOGGIA, a scalable graph neural routing algorithm that predicts log-space link weights from attributed topology-and-telemetry graphs. It utilizes a data-driven pre-training stage, followed by on-policy Reinforcement Learning. Across synthetic and real network topologies, and unseen mixed TCP/UDP traffic sequences, LOGGIA consistently outperforms shortest-path baselines, whereas neural baselines fail once realistic delays are enforced. Our experiments further suggest that neural routing algorithms like LOGGIA perform best when deployed fully locally, i.e., observing network states and inferring actions at every router individually, as opposed to centralized decision making.

Paper Structure

This paper contains 36 sections, 3 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: We view telemetry-aware near-real time routing as a closed-loop control problem, illustrated here with distributed observers and routing agents. The network environment (center) operates on an externally provided network topology (upper right) and evolves a network state $S_t$, given populated forwarding tables $\mathbf{a}_t$ and upcoming traffic arrivals $\mathbf{z}_t$. During operation, telemetry information is used to form local network snapshots (lower right, shown for node 3 and incident links). The new state $S_{t+1}$, is the union of all local snapshots, and observers obtain localized views $\mathbf{O}_{t}$ thereof (lower left, shown for nodes 2 and 4). Routing policies $\pi$, which can include neural networks or conventional algorithms, calculate new routing decisions $\mathbf{\hat{a}}_{t}$ from the provided observations (upper left), which are then used to update the forwarding table.
  • Figure 2: Observation graph assembly for the mini5 topology, shown in a) with link delays in ms. b): The "birds-eye" view of network state $S_t$ contains all current node and edge states $\mathbf{x}_t$. c): We designate node $1$ as the central node $v_c$, having the lowest maximum delay of $8$ ms to all other nodes. Colored edges show the minimum-delay spanning tree used for communicating state, observation, and action information. d): Observation graph $O_{1,t}$ of node $1$ at time $t$. For a step granularity of $\tau = 5$ ms, node states $\mathbf{x}_{0,t}$ and $\mathbf{x}_{2,t}$ will be available to node $1$ at time $t+1$; $\mathbf{x}_{3,t}$ and $\mathbf{x}_{4,t}$ will be available at time $t+2$.
  • Figure 3: The different possible deployment options within our framework. Each interaction step requires local routing preferences $\hat{a}_{v,t}$ to be installed at the respective routers. Left: Observer deployments. The Birdseye observer has access to the latest node and edge states $\mathbf{x}_t$ in the network without delay, observing the network state graph $S_t$ as-is. The Central observer aggregates $\mathbf{x}_t$ into an observation graph $O_t$ subject to communication delays. Local observers each aggregate $\mathbf{x}_t$ from the network, subject to delays, into their own views $O_{v,t}$. Right: Agent deployments. The Single agent can be a birdseye agent or a central agent, and for the latter, actions $\hat{a}_{v,t}$ must be communicated to routing nodes $v \in V_t$ after inference. In the Multi-agent setting, distributed routing agents work with the network state $S_t$ or central observation $O_t$ (the latter incurs delay when $O_t$ is communicated to the local agents), or locally obtained observations $O_{v,t}$.
  • Figure 4: Network topologies used in our experiments. From left to right: mini5, B4 (version used in xuTealLearningAcceleratedOptimization2023), GEANT (2001 version obtained from the Topology Zoo, knightInternetTopologyZoo2011), nx-XS (example), nx-S (example). Thicker edges denote larger link datarate, fewer dashes denote lower link latency.
  • Figure 5: Performance of baseline algorithms and our algorithm LOGGIA for various topology presets, evaluated in delay-aware deployment. Dashed lines denote the best SP baseline per evaluation. Above the zigzag line, the y-axis is enlarged for better readability. In the more realistic near-real time routing setting, all neural routing algorithms except for our approach LOGGIA fail to outperform static shortest-path routing. Moreover, LOGGIA shows lower performance variance for most topology presets.
  • ...and 14 more figures