Table of Contents
Fetching ...

Multi-Agent Reinforcement Learning for Task Offloading in Wireless Edge Networks

Andrea Fox, Francesco De Pellegrini, Eitan Altman

TL;DR

This paper tackles scalable, decentralized task offloading in wireless edge networks by formulating each device as an independent constrained MDP (CMDP) and coordinating all agents through infrequently updated shared constraints. The proposed Decentralized Coordination via CMDPs (DCC) framework uses a three-timescale learning scheme: fast local policy optimization under a decomposed, approximate reward, intermediate Lagrange multiplier updates to enforce long-term constraints, and slow, stochastic optimization of the constraint vector to align with global objectives. The authors provide a theoretical bound on the reward approximation, differentiability results, and gradient-simplification techniques, and validate the approach on toy edge-offloading scenarios where DCC-QL outperforms independent Q-learning and competitive CTDE baselines, especially as system size grows. The work demonstrates that lightweight, constraint-driven coordination can yield scalable, communication-efficient performance improvements in congestible wireless edge environments, with clear directions for extending to asynchronous updates and broader empirical validation.

Abstract

In edge computing systems, autonomous agents must make fast local decisions while competing for shared resources. Existing MARL methods often resume to centralized critics or frequent communication, which fail under limited observability and communication constraints. We propose a decentralized framework in which each agent solves a constrained Markov decision process (CMDP), coordinating implicitly through a shared constraint vector. For the specific case of offloading, e.g., constraints prevent overloading shared server resources. Coordination constraints are updated infrequently and act as a lightweight coordination mechanism. They enable agents to align with global resource usage objectives but require little direct communication. Using safe reinforcement learning, agents learn policies that meet both local and global goals. We establish theoretical guarantees under mild assumptions and validate our approach experimentally, showing improved performance over centralized and independent baselines, especially in large-scale settings.

Multi-Agent Reinforcement Learning for Task Offloading in Wireless Edge Networks

TL;DR

This paper tackles scalable, decentralized task offloading in wireless edge networks by formulating each device as an independent constrained MDP (CMDP) and coordinating all agents through infrequently updated shared constraints. The proposed Decentralized Coordination via CMDPs (DCC) framework uses a three-timescale learning scheme: fast local policy optimization under a decomposed, approximate reward, intermediate Lagrange multiplier updates to enforce long-term constraints, and slow, stochastic optimization of the constraint vector to align with global objectives. The authors provide a theoretical bound on the reward approximation, differentiability results, and gradient-simplification techniques, and validate the approach on toy edge-offloading scenarios where DCC-QL outperforms independent Q-learning and competitive CTDE baselines, especially as system size grows. The work demonstrates that lightweight, constraint-driven coordination can yield scalable, communication-efficient performance improvements in congestible wireless edge environments, with clear directions for extending to asynchronous updates and broader empirical validation.

Abstract

In edge computing systems, autonomous agents must make fast local decisions while competing for shared resources. Existing MARL methods often resume to centralized critics or frequent communication, which fail under limited observability and communication constraints. We propose a decentralized framework in which each agent solves a constrained Markov decision process (CMDP), coordinating implicitly through a shared constraint vector. For the specific case of offloading, e.g., constraints prevent overloading shared server resources. Coordination constraints are updated infrequently and act as a lightweight coordination mechanism. They enable agents to align with global resource usage objectives but require little direct communication. Using safe reinforcement learning, agents learn policies that meet both local and global goals. We establish theoretical guarantees under mild assumptions and validate our approach experimentally, showing improved performance over centralized and independent baselines, especially in large-scale settings.

Paper Structure

This paper contains 52 sections, 7 theorems, 44 equations, 13 figures, 2 tables, 3 algorithms.

Key Result

Lemma 1

Given a global policy $\pi$ and a vector $\theta \in \mathbb{R}^N$ such that where $N_{i}(a)$ denotes the random variable representing the frequency with which agent $i$ selects the crowded action at time $t$, it is verified that for a non linear penalty function $d$ Moreover, if $d$ is linear, the two reward values coincide exactly, and the approximation becomes exact.

Figures (13)

  • Figure 1: We compare the final normalized reward after 5 iteration of DCC-QL when starting from a naive constraint ($\theta = 0$) and when starting from an optimized value, where the optimized one has been chosen by looking at the final values obtained by DCC-QL when starting from naive initial constraints.
  • Figure 2: Comparison of the evolution of the reward as we start from optimized initial constraints in settings with a different amount of devices.
  • Figure 3: Evolution of the offloading action frequency for different values of the penalty exponent $\alpha$. Increasing $\alpha$ leads to a consistent decrease in offloading frequency across algorithms, reflecting the stronger penalization of simultaneous offloading.
  • Figure 4: In these figures we evaluate an approximation of the gradient of $\tilde{J}_i^\ell(\theta)$ using the finite difference method. We considered a noise $\epsilon \in (0.01, 0.25)$ for both cases, In the left figure, representing the local component of the noise, we expected negative values, while in the right figure we expected positive values.
  • Figure 5: In these figures we evaluate an exact gradient of $\tilde{J}_i^\ell(\theta)$ using the finite difference method with a very small noise $\epsilon \in \{ -0.00001, 0.00001 \}$.
  • ...and 8 more figures

Theorems & Definitions (16)

  • Lemma 1
  • proof
  • Proposition 1
  • proof
  • Lemma 2
  • proof
  • Theorem 1
  • proof
  • Proposition 2
  • proof
  • ...and 6 more