Structured Reinforcement Learning for Media Streaming at the Wireless Edge
Archana Bura, Sarat Chandra Bobbili, Shreyas Rameshkumar, Desik Rengarajan, Dileep Kalathil, Srinivas Shakkottai
TL;DR
The paper addresses dynamic resource allocation for media streaming at the wireless edge by formulating it as a constrained MDP (CMDP) and showing that a Lagrangian relaxation decouples the problem into per-client subproblems, each with a threshold policy on buffer length. It introduces a primal-dual natural policy gradient method with a soft-threshold parameterization to efficiently learn these threshold policies under a constraint on high-priority allocations, with convergence guarantees at a rate of $O(1/ sqrt{T})$. The approach is validated in simulation, where decentralized, per-client learning achieves roughly 4× faster training than centralized methods while maintaining performance, and in a real-world WiFi testbed where YouTube QoE improves by over 30% relative to a vanilla policy. The results demonstrate fast, robust deployment of structured RL at the wireless edge, producing ultra-fast inference and practical deployment advantages. Overall, the work advances model-driven RL for edge networking by exploiting problem structure to deliver scalable, high-QoE streaming with provable guarantees and real-world viability.
Abstract
Media streaming is the dominant application over wireless edge (access) networks. The increasing softwarization of such networks has led to efforts at intelligent control, wherein application-specific actions may be dynamically taken to enhance the user experience. The goal of this work is to develop and demonstrate learning-based policies for optimal decision making to determine which clients to dynamically prioritize in a video streaming setting. We formulate the policy design question as a constrained Markov decision problem (CMDP), and observe that by using a Lagrangian relaxation we can decompose it into single-client problems. Further, the optimal policy takes a threshold form in the video buffer length, which enables us to design an efficient constrained reinforcement learning (CRL) algorithm to learn it. Specifically, we show that a natural policy gradient (NPG) based algorithm that is derived using the structure of our problem converges to the globally optimal policy. We then develop a simulation environment for training, and a real-world intelligent controller attached to a WiFi access point for evaluation. We empirically show that the structured learning approach enables fast learning. Furthermore, such a structured policy can be easily deployed due to low computational complexity, leading to policy execution taking only about 15$μ$s. Using YouTube streaming experiments in a resource constrained scenario, we demonstrate that the CRL approach can increase quality of experience (QOE) by over 30\%.
