Semi-Gradient SARSA Routing with Theoretical Guarantee on Traffic Stability and Weight Convergence
Yidan Wu, Yu Yu, Jianan Zhang, Li Jin
TL;DR
This work tackles dynamic routing over parallel servers with unbounded state spaces by introducing a semi-gradient SARSA algorithm using linear function approximation for Q-value estimation. A Lyapunov-based drift analysis combined with an ODE-based stochastic approximation framework yields a joint convergence guarantee: the traffic state remains stable and the weight vector converges almost surely to the approximate optimum $w^*$, if and only if the system is stabilizable with $\lambda<\sum_n\mu_n$. The approach uses a softmax policy over a linear-approximate $Q$-function $\hat{Q}(x,a;w)$ built from basis functions, enabling interpretable routing decisions and avoiding brittle neural networks. Simulation on a TCP-like congestion control problem shows SGS converges much faster than neural-network SARSA and achieves a substantial reduction in the average cost compared with a join-the-shortest-queue baseline, with a small optimality gap. These results provide a provably stable, scalable RL framework for dynamic routing in networks and related parallel-server systems.
Abstract
We consider the traffic control problem of dynamic routing over parallel servers, which arises in a variety of engineering systems such as transportation and data transmission. We propose a semi-gradient, on-policy algorithm that learns an approximate optimal routing policy. The algorithm uses generic basis functions with flexible weights to approximate the value function across the unbounded state space. Consequently, the training process lacks Lipschitz continuity of the gradient, boundedness of the temporal-difference error, and a prior guarantee on ergodicity, which are the standard prerequisites in existing literature on reinforcement learning theory. To address this, we combine a Lyapunov approach and an ordinary differential equation-based method to jointly characterize the behavior of traffic state and approximation weights. Our theoretical analysis proves that the training scheme guarantees traffic state stability and ensures almost surely convergence of the weights to the approximate optimum. We also demonstrate via simulations that our algorithm attains significantly faster convergence than neural network-based methods with an insignificant approximation error.
