Differentiable Discrete Event Simulation for Queuing Network Control
Ethan Che, Jing Dong, Hongseok Namkoong
TL;DR
This work introduces PATHWISE, a differentiable discrete-event simulation framework for queueing network control that computes pathwise gradients of performance with respect to scheduling actions. By applying capacity-sharing relaxations and a straight-through smoothing approach to the non-differentiable event selection, PATHWISE enables efficient gradient-based optimization of neural policies and achieves orders-of-magnitude gains in sample efficiency over model-free baselines like REINFORCE. A work-conserving softmax policy further stabilizes training across large, non-stationary networks. Empirically, PATHWISE outperforms PPO baselines and standard queuing policies on scheduling and admission-control tasks, particularly in large-scale and high-variance settings, and theory on the M/M/1 queue explains the variance advantages over REINFORCE. Overall, the framework offers a practical, scalable method for learning in complex discrete-event systems and suggests broad applicability beyond queuing networks.
Abstract
Queuing network control is essential for managing congestion in job-processing systems such as service systems, communication networks, and manufacturing processes. Despite growing interest in applying reinforcement learning (RL) techniques, queueing network control poses distinct challenges, including high stochasticity, large state and action spaces, and lack of stability. To tackle these challenges, we propose a scalable framework for policy optimization based on differentiable discrete event simulation. Our main insight is that by implementing a well-designed smoothing technique for discrete event dynamics, we can compute pathwise policy gradients for large-scale queueing networks using auto-differentiation software (e.g., Tensorflow, PyTorch) and GPU parallelization. Through extensive empirical experiments, we observe that our policy gradient estimators are several orders of magnitude more accurate than typical REINFORCE-based estimators. In addition, We propose a new policy architecture, which drastically improves stability while maintaining the flexibility of neural-network policies. In a wide variety of scheduling and admission control tasks, we demonstrate that training control policies with pathwise gradients leads to a 50-1000x improvement in sample efficiency over state-of-the-art RL methods. Unlike prior tailored approaches to queueing, our methods can flexibly handle realistic scenarios, including systems operating in non-stationary environments and those with non-exponential interarrival/service times.
