Table of Contents
Fetching ...

Structured Reinforcement Learning for Media Streaming at the Wireless Edge

Archana Bura, Sarat Chandra Bobbili, Shreyas Rameshkumar, Desik Rengarajan, Dileep Kalathil, Srinivas Shakkottai

TL;DR

The paper addresses dynamic resource allocation for media streaming at the wireless edge by formulating it as a constrained MDP (CMDP) and showing that a Lagrangian relaxation decouples the problem into per-client subproblems, each with a threshold policy on buffer length. It introduces a primal-dual natural policy gradient method with a soft-threshold parameterization to efficiently learn these threshold policies under a constraint on high-priority allocations, with convergence guarantees at a rate of $O(1/ sqrt{T})$. The approach is validated in simulation, where decentralized, per-client learning achieves roughly 4× faster training than centralized methods while maintaining performance, and in a real-world WiFi testbed where YouTube QoE improves by over 30% relative to a vanilla policy. The results demonstrate fast, robust deployment of structured RL at the wireless edge, producing ultra-fast inference and practical deployment advantages. Overall, the work advances model-driven RL for edge networking by exploiting problem structure to deliver scalable, high-QoE streaming with provable guarantees and real-world viability.

Abstract

Media streaming is the dominant application over wireless edge (access) networks. The increasing softwarization of such networks has led to efforts at intelligent control, wherein application-specific actions may be dynamically taken to enhance the user experience. The goal of this work is to develop and demonstrate learning-based policies for optimal decision making to determine which clients to dynamically prioritize in a video streaming setting. We formulate the policy design question as a constrained Markov decision problem (CMDP), and observe that by using a Lagrangian relaxation we can decompose it into single-client problems. Further, the optimal policy takes a threshold form in the video buffer length, which enables us to design an efficient constrained reinforcement learning (CRL) algorithm to learn it. Specifically, we show that a natural policy gradient (NPG) based algorithm that is derived using the structure of our problem converges to the globally optimal policy. We then develop a simulation environment for training, and a real-world intelligent controller attached to a WiFi access point for evaluation. We empirically show that the structured learning approach enables fast learning. Furthermore, such a structured policy can be easily deployed due to low computational complexity, leading to policy execution taking only about 15$μ$s. Using YouTube streaming experiments in a resource constrained scenario, we demonstrate that the CRL approach can increase quality of experience (QOE) by over 30\%.

Structured Reinforcement Learning for Media Streaming at the Wireless Edge

TL;DR

The paper addresses dynamic resource allocation for media streaming at the wireless edge by formulating it as a constrained MDP (CMDP) and showing that a Lagrangian relaxation decouples the problem into per-client subproblems, each with a threshold policy on buffer length. It introduces a primal-dual natural policy gradient method with a soft-threshold parameterization to efficiently learn these threshold policies under a constraint on high-priority allocations, with convergence guarantees at a rate of . The approach is validated in simulation, where decentralized, per-client learning achieves roughly 4× faster training than centralized methods while maintaining performance, and in a real-world WiFi testbed where YouTube QoE improves by over 30% relative to a vanilla policy. The results demonstrate fast, robust deployment of structured RL at the wireless edge, producing ultra-fast inference and practical deployment advantages. Overall, the work advances model-driven RL for edge networking by exploiting problem structure to deliver scalable, high-QoE streaming with provable guarantees and real-world viability.

Abstract

Media streaming is the dominant application over wireless edge (access) networks. The increasing softwarization of such networks has led to efforts at intelligent control, wherein application-specific actions may be dynamically taken to enhance the user experience. The goal of this work is to develop and demonstrate learning-based policies for optimal decision making to determine which clients to dynamically prioritize in a video streaming setting. We formulate the policy design question as a constrained Markov decision problem (CMDP), and observe that by using a Lagrangian relaxation we can decompose it into single-client problems. Further, the optimal policy takes a threshold form in the video buffer length, which enables us to design an efficient constrained reinforcement learning (CRL) algorithm to learn it. Specifically, we show that a natural policy gradient (NPG) based algorithm that is derived using the structure of our problem converges to the globally optimal policy. We then develop a simulation environment for training, and a real-world intelligent controller attached to a WiFi access point for evaluation. We empirically show that the structured learning approach enables fast learning. Furthermore, such a structured policy can be easily deployed due to low computational complexity, leading to policy execution taking only about 15s. Using YouTube streaming experiments in a resource constrained scenario, we demonstrate that the CRL approach can increase quality of experience (QOE) by over 30\%.
Paper Structure (19 sections, 14 theorems, 27 equations, 14 figures, 1 algorithm)

This paper contains 19 sections, 14 theorems, 27 equations, 14 figures, 1 algorithm.

Key Result

Lemma 1

Given that assumption eqn:slatercond holds true, then, $D(\lambda^*) = \sum_{n=1}^N J_c^{\pi^*}(\rho_n)$

Figures (14)

  • Figure 1: Feedback loop in a media streaming application. The states of the YouTube sessions and channel qualities are communicated to an intelligent controller that determines the service class for each session, accounting for resource constraints. The impact of this decision on the end-user QoE (reward) is communicated back to the controller.
  • Figure 2: Training in simulation
  • Figure 3: QoE in simulation
  • Figure 4: Clients in high priority service
  • Figure 5: Inference time for DC-T and PPO algorithms for $6$ clients
  • ...and 9 more figures

Theorems & Definitions (15)

  • Lemma 1
  • Theorem 1
  • Theorem 2
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Corollary 1
  • Theorem 3
  • Lemma 5
  • Lemma 6
  • ...and 5 more