Differentiable Discrete Event Simulation for Queuing Network Control

Ethan Che; Jing Dong; Hongseok Namkoong

Differentiable Discrete Event Simulation for Queuing Network Control

Ethan Che, Jing Dong, Hongseok Namkoong

TL;DR

This work introduces PATHWISE, a differentiable discrete-event simulation framework for queueing network control that computes pathwise gradients of performance with respect to scheduling actions. By applying capacity-sharing relaxations and a straight-through smoothing approach to the non-differentiable event selection, PATHWISE enables efficient gradient-based optimization of neural policies and achieves orders-of-magnitude gains in sample efficiency over model-free baselines like REINFORCE. A work-conserving softmax policy further stabilizes training across large, non-stationary networks. Empirically, PATHWISE outperforms PPO baselines and standard queuing policies on scheduling and admission-control tasks, particularly in large-scale and high-variance settings, and theory on the M/M/1 queue explains the variance advantages over REINFORCE. Overall, the framework offers a practical, scalable method for learning in complex discrete-event systems and suggests broad applicability beyond queuing networks.

Abstract

Queuing network control is essential for managing congestion in job-processing systems such as service systems, communication networks, and manufacturing processes. Despite growing interest in applying reinforcement learning (RL) techniques, queueing network control poses distinct challenges, including high stochasticity, large state and action spaces, and lack of stability. To tackle these challenges, we propose a scalable framework for policy optimization based on differentiable discrete event simulation. Our main insight is that by implementing a well-designed smoothing technique for discrete event dynamics, we can compute pathwise policy gradients for large-scale queueing networks using auto-differentiation software (e.g., Tensorflow, PyTorch) and GPU parallelization. Through extensive empirical experiments, we observe that our policy gradient estimators are several orders of magnitude more accurate than typical REINFORCE-based estimators. In addition, We propose a new policy architecture, which drastically improves stability while maintaining the flexibility of neural-network policies. In a wide variety of scheduling and admission control tasks, we demonstrate that training control policies with pathwise gradients leads to a 50-1000x improvement in sample efficiency over state-of-the-art RL methods. Unlike prior tailored approaches to queueing, our methods can flexibly handle realistic scenarios, including systems operating in non-stationary environments and those with non-exponential interarrival/service times.

Differentiable Discrete Event Simulation for Queuing Network Control

TL;DR

Abstract

Paper Structure (32 sections, 4 theorems, 104 equations, 14 figures, 3 tables, 1 algorithm)

This paper contains 32 sections, 4 theorems, 104 equations, 14 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Scheduling in Queuing Networks
Reinforcement Learning in Queueing Network Control
Differentiable Simulation in RL and Operations Research
Gradient Estimation in Machine Learning
Gradient Estimation in Operations Research
Discrete-Event Dynamical System Model for Queuing Networks
The Scheduling Problem
System Description
Queuing Network Examples
Gradient Estimation
The standard approach: the $\mathsf{REINFORCE}$ estimator
Our approach: Differentiable Discrete-Event Simulation
Capacity sharing relaxation
...and 17 more sections

Key Result

Theorem 1

Let $\widehat{\nabla}_\mu (x_{k+1} - x_{k}) = \widehat{\nabla}_\mu De_{k+1}$ denote the $\mathsf{PATHWISE}$ gradient estimator of the one-step transition of the $M/M/1$ queue with respect to $\mu$. For $x_k\geq 1$, as $\beta \to \infty$,

Figures (14)

Figure 1: Improvements in sample efficiency of our proposed $\mathsf{PATHWISE}$ policy gradient estimator over a standard model-free RL estimator, $\mathsf{REINFORCE}$. (Left) Samples of policy gradient estimators for a parameterized MaxPressure policy in a criss-cross network with traffic intensity $\rho=0.9$ (see Example \ref{['example:criss-cross']}). Each draw of the $\mathsf{REINFORCE}$ estimator is averaged over $B = 10^3$ trajectories and is equipped with a value function baseline, which is fitted using $10^6$ state transitions. The $\mathsf{PATHWISE}$ estimator uses only a single trajectory, and no value function. Despite using less data, it is more closely aligned with the true gradient. (Right) Average cosine similarity (higher is better) of policy gradient estimators with the true policy gradient (see \ref{['eqn:similarity']} for more details) across different levels of traffic intensity for the criss-cross network. For $\mathsf{REINFORCE}$, we plot the cosine similarity of the estimator under different batch sizes $B=1,..,10^{4}$. We see that the efficiency advantages of $\mathsf{PATHWISE}$, with only 1 trajectory, are greater under higher traffic intensities, even outperforming $\mathsf{REINFORCE}$ with a value function baseline and $B=10^{4}$ trajectories.
Figure 2: (Left) Pseudo-code of a single gradient step of our proposed $\mathsf{PATHWISE}$ estimator. Computing the estimator requires only a few lines of code to compute the cost incurred by the policy. Once this cost is calculated, the sample path gradient is computed automatically via reverse-mode auto-differentiation. Unlike standard methods such as infinitesimal perturbation analysis or likelihood-ratio estimation, we can apply the same code for any network without any bespoke modifications. Unlike model-free gradient estimators like $\mathsf{REINFORCE}$, our method does not need a separate value function fitting step, managing a replay buffer, feature/return normalization, generalized advantage estimation, etc., as it has a low variance without any modification. (Right) A sample path of the total queue length (light blue) for a multi-class queuing network (see Example \ref{['example:multiclass']}) under a randomized priority scheduling policy. Along the path, we display the gradients (dark blue) computed using our framework of the average cost with respect to each action produced by the policy: $\nabla_{u_{k}} \frac{1}{N}\sum_{k=0}^{N-1} c(x_{k},u_{k})\tau^{*}_{k+1}$.
Figure 3: One step of the dynamics for the criss-cross network (see Example \ref{['example:criss-cross']} and Figure \ref{['fig:networks']}). There are 3 queues and 2 servers. Beginning with queue-lengths $x_{k} = (3, 1, 4)$ and workloads $w_{k}$, the action $u_{k}$ assigns server 1 to queue 3 and server 2 to queue 2. The workloads of the selected queues are highlighted in light green. As a result, the valid events are arrivals to queue 1 and queue 3 (queue 2 has no external arrivals) and job completions for queue 2 and queue 3 (queue 1 cannot experience any job completions because no server is assigned). The arrival event to queue 1 has the minimum residual time (highlighted in red) so it is the next event, and $e_{k+1}$ is a one-hot vector indicating this. Since an arrival occurred, the queue-lengths are updated as $x_{k+1} = (4, 1, 4)$.
Figure 4: $M/M/1$ queue.
Figure 5: Multi-class, single-server queue.
...and 9 more figures

Theorems & Definitions (5)

Definition 1
Theorem 1
Corollary 1
Theorem 2
Corollary 2

Differentiable Discrete Event Simulation for Queuing Network Control

TL;DR

Abstract

Differentiable Discrete Event Simulation for Queuing Network Control

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (5)