Optimal Flow Admission Control in Edge Computing via Safe Reinforcement Learning

A. Fox; F. De Pellegrini; F. Faticanti; E. Altman; F. Bronzino

Optimal Flow Admission Control in Edge Computing via Safe Reinforcement Learning

A. Fox, F. De Pellegrini, F. Faticanti, E. Altman, F. Bronzino

TL;DR

This work tackles admission control of heterogeneous information flows in edge computing by formulating the problem as a constrained Markov decision process (CMDP) that accounts for edge compute and access-network capacities. It introduces DR-CPO, a safe reinforcement learning algorithm that uses reward decomposition and Lagrangian relaxation to learn a decentralized optimal admission policy, with provable structural properties and convergence guarantees. Compared with a general-purpose DRL baseline, DR-CPO delivers up to 15% higher long-term reward and converges in roughly half the learning episodes across diverse environments, while mitigating state-space explosion. The authors also couple the learned admission policy with a two-stage load-balancing scheme to further enhance system performance and resource utilization in multi-server edge settings. The approach provides a scalable, provably safe framework for flow-aware edge analytics and points to future work on joint routing and content-aware admissions.

Abstract

With the uptake of intelligent data-driven applications, edge computing infrastructures necessitate a new generation of admission control algorithms to maximize system performance under limited and highly heterogeneous resources. In this paper, we study how to optimally select information flows which belong to different classes and dispatch them to multiple edge servers where deployed applications perform flow analytic tasks. The optimal policy is obtained via constrained Markov decision process (CMDP) theory accounting for the demand of each edge application for specific classes of flows, the constraints on computing capacity of edge servers and of the access network. We develop DR-CPO, a specialized primal-dual Safe Reinforcement Learning (SRL) method which solves the resulting optimal admission control problem by reward decomposition. DR-CPO operates optimal decentralized control and mitigates effectively state-space explosion while preserving optimality. Compared to existing Deep Reinforcement Learning (DRL) solutions, extensive results show that DR-CPO achieves 15\% higher reward on a wide variety of environments, while requiring on average only 50\% of the amount of learning episodes to converge. Finally, we show how to match DR-CPO and load-balancing to dispatch optimally information streams to available edge servers and further improve system performance.

Optimal Flow Admission Control in Edge Computing via Safe Reinforcement Learning

TL;DR

Abstract

Paper Structure (30 sections, 5 theorems, 27 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 30 sections, 5 theorems, 27 equations, 5 figures, 2 tables, 2 algorithms.

Introduction
System Model
System state.
Action set.
Probability kernel.
Rewards.
Policy.
The CMDP model
Learning the optimal admission policy
Reward and cost decomposition.
Lagrangian relaxation methods for safe reinforcement learning.
Decomposed Reward Constrained Policy Optimization.
Convergence of DRCPO.
load balancing
Numerical results
...and 15 more sections

Key Result

Theorem 1

If the MVAC problem is feasible, then i. There exists an optimal stationary policy $\pi$ which is randomized in at most $M$ states; ii. Such policy is a deterministic stationary policy if the constraint is not active; iii. When at least one constraint is active, within the optimal stationary policy

Figures (5)

Figure 1: (a) Camera arrival and departure: a camera arrives in area, it transmits its flow towards a tagged server (boxed index), then departs; (b) System state: using notation in Tab. \ref{['tab:notation']}, $M = 4$; $\mathcal{D}^1 = \{ A, B \}; \mathcal{D}^2 = \{ A, C \}; \mathcal{D}^3 = \{ B \}; \mathcal{D}^4 = \{ A \}$; $X^1 = (2, 1, 0, 0); X^2 = (0, 2, 0, 0), X^3 = (1, 0, 2, 0); X^4 = (0, 0, 0, 2)$; $j = 1, i= 1$; $Y^1 = 3, Y^2 = 2, Y^3 = 3, Y^4 = 2$.
Figure 2: Learning dynamics for the a) discounted reward and b) discounted cost function.
Figure 3: Optimal reward distribution at the increase of the number of applications per server.
Figure 4: The joint reward and Learning dynamics for various values of $d^i$; in (a), (b), (c), (d), and (e) the discounted cost dynamics. Dashed line: median. Upper and lower borders of the shaded regions: server with highest and lowest associated cost, respectively.
Figure 5: Performance of different load balancing policies; values on the y-axis normalized w.r.t. naive uniform load balancing: (a)increasing ratio $\psi / \theta$ per server and (b)increasing number of areas $M$.

Theorems & Definitions (10)

Theorem 1
proof
Proposition 1
Proposition 2
proof
Lemma 1
proof
proof
Proposition 3
proof

Optimal Flow Admission Control in Edge Computing via Safe Reinforcement Learning

TL;DR

Abstract

Optimal Flow Admission Control in Edge Computing via Safe Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (10)