Table of Contents
Fetching ...

Task Specific Sharpness Aware O-RAN Resource Management using Multi Agent Reinforcement Learning

Fatemeh Lotfi, Hossein Rajoli, Fatemeh Afghah

TL;DR

This work tackles robust, scalable resource management for dynamic O-RAN networks by integrating Sharpness-Aware Minimization into a distributed SAC-based MARL framework (TA-SAM MARL). A TD-error variance–driven mechanism selectively applies SAM to actors and critics to promote flatter, more generalizable loss landscapes while dynamically scheduling the SAM radius $ ho$ to balance exploration and exploitation across heterogeneous slices. Empirical results show up to 22% gains in resource allocation efficiency and QoS satisfaction across eMBB, mMTC, and URLLC slices, with improved stability and reduced forgetting in non-stationary network conditions. The proposed approach demonstrates strong generalization, scalability, and resilience, making it well-suited for deployment in near-real-time O-RAN control loops, with future work focusing on latency-aware inference and policy compression for hardware accelerators.

Abstract

Next-generation networks utilize the Open Radio Access Network (O-RAN) architecture to enable dynamic resource management, facilitated by the RAN Intelligent Controller (RIC). While deep reinforcement learning (DRL) models show promise in optimizing network resources, they often struggle with robustness and generalizability in dynamic environments. This paper introduces a novel resource management approach that enhances the Soft Actor Critic (SAC) algorithm with Sharpness-Aware Minimization (SAM) in a distributed Multi-Agent RL (MARL) framework. Our method introduces an adaptive and selective SAM mechanism, where regularization is explicitly driven by temporal-difference (TD)-error variance, ensuring that only agents facing high environmental complexity are regularized. This targeted strategy reduces unnecessary overhead, improves training stability, and enhances generalization without sacrificing learning efficiency. We further incorporate a dynamic $ρ$ scheduling scheme to refine the exploration-exploitation trade-off across agents. Experimental results show our method significantly outperforms conventional DRL approaches, yielding up to a $22\%$ improvement in resource allocation efficiency and ensuring superior QoS satisfaction across diverse O-RAN slices.

Task Specific Sharpness Aware O-RAN Resource Management using Multi Agent Reinforcement Learning

TL;DR

This work tackles robust, scalable resource management for dynamic O-RAN networks by integrating Sharpness-Aware Minimization into a distributed SAC-based MARL framework (TA-SAM MARL). A TD-error variance–driven mechanism selectively applies SAM to actors and critics to promote flatter, more generalizable loss landscapes while dynamically scheduling the SAM radius to balance exploration and exploitation across heterogeneous slices. Empirical results show up to 22% gains in resource allocation efficiency and QoS satisfaction across eMBB, mMTC, and URLLC slices, with improved stability and reduced forgetting in non-stationary network conditions. The proposed approach demonstrates strong generalization, scalability, and resilience, making it well-suited for deployment in near-real-time O-RAN control loops, with future work focusing on latency-aware inference and policy compression for hardware accelerators.

Abstract

Next-generation networks utilize the Open Radio Access Network (O-RAN) architecture to enable dynamic resource management, facilitated by the RAN Intelligent Controller (RIC). While deep reinforcement learning (DRL) models show promise in optimizing network resources, they often struggle with robustness and generalizability in dynamic environments. This paper introduces a novel resource management approach that enhances the Soft Actor Critic (SAC) algorithm with Sharpness-Aware Minimization (SAM) in a distributed Multi-Agent RL (MARL) framework. Our method introduces an adaptive and selective SAM mechanism, where regularization is explicitly driven by temporal-difference (TD)-error variance, ensuring that only agents facing high environmental complexity are regularized. This targeted strategy reduces unnecessary overhead, improves training stability, and enhances generalization without sacrificing learning efficiency. We further incorporate a dynamic scheduling scheme to refine the exploration-exploitation trade-off across agents. Experimental results show our method significantly outperforms conventional DRL approaches, yielding up to a improvement in resource allocation efficiency and ensuring superior QoS satisfaction across diverse O-RAN slices.

Paper Structure

This paper contains 34 sections, 12 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: System topology showing multi-agent Soft Actor-Critic (SAC) architecture for O-RAN. Each O-DU serves as a local actor, while the xApp in the near-RT RIC functions as the global critic. Solid arrows indicate action/state flow from actors to the critic; dashed feedback loops represent critic updates to the actors. The architecture interfaces with the O-CU via E1 and the near-RT RIC via E2.
  • Figure 2: Multi-agent RL (MARL) framework for O-RAN showing actor-critic interactions.
  • Figure 3: Diagram of SAM Parameter Perturbation and Update.
  • Figure 4: Overview of the TA-SAM MARL training framework. The figure distinguishes actor and critic components using dashed boxes and separates their update paths through color-coded arrows. The actor flow includes: (1) actor weight initialization, (2) experience collection and replay buffer, (3) policy gradient computation, and (4) SAM-based policy parameter update. The critic flow involves: (5) accessing replay buffer data, (6) computing the critic loss, and (7) updating the critic parameters. Then, SAM perturbs policy parameters before the final update to promote flatter minima. The flow applies to all agents, while SAM updates are selectively applied only to those identified by TD-error variance.
  • Figure 5: Average cumulative reward values in different $\rho$ selection scenarios
  • ...and 6 more figures