Table of Contents
Fetching ...

Risk-Aware Continuous Control with Neural Contextual Bandits

Jose A. Ayala-Romero, Andres Garcia-Saavedra, Xavier Costa-Perez

TL;DR

This work tackles risk-aware decision-making in constrained contextual bandits with continuous actions by introducing an actor-multi-critic framework in which each critic models the distribution of a metric (reward and constraints) using distributional methods based on quantile regression. A deterministic actor, guided by an aggregated reward R^{agg}(s,a,\alpha|\eta) that incorporates tail-risk penalties via the \alpha-quantile, enables per-metric risk tuning and safe operation under stochastic constraints. The approach, named Risk-Aware Neural Contextual Bandit (RANCB), supports continuous actions and can adjust the risk-performance trade-off through \alpha, potentially per constraint, to meet application requirements. Empirical results in synthetic environments and a 5G resource-allocation use case show improved constraint satisfaction with controllable performance costs; SafeOPT serves as a relevant baseline but suffers from computational and practical limitations in real-time settings.

Abstract

Recent advances in learning techniques have garnered attention for their applicability to a diverse range of real-world sequential decision-making problems. Yet, many practical applications have critical constraints for operation in real environments. Most learning solutions often neglect the risk of failing to meet these constraints, hindering their implementation in real-world contexts. In this paper, we propose a risk-aware decision-making framework for contextual bandit problems, accommodating constraints and continuous action spaces. Our approach employs an actor multi-critic architecture, with each critic characterizing the distribution of performance and constraint metrics. Our framework is designed to cater to various risk levels, effectively balancing constraint satisfaction against performance. To demonstrate the effectiveness of our approach, we first compare it against state-of-the-art baseline methods in a synthetic environment, highlighting the impact of intrinsic environmental noise across different risk configurations. Finally, we evaluate our framework in a real-world use case involving a 5G mobile network where only our approach consistently satisfies the system constraint (a signal processing reliability target) with a small performance toll (8.5% increase in power consumption).

Risk-Aware Continuous Control with Neural Contextual Bandits

TL;DR

This work tackles risk-aware decision-making in constrained contextual bandits with continuous actions by introducing an actor-multi-critic framework in which each critic models the distribution of a metric (reward and constraints) using distributional methods based on quantile regression. A deterministic actor, guided by an aggregated reward R^{agg}(s,a,\alpha|\eta) that incorporates tail-risk penalties via the \alpha-quantile, enables per-metric risk tuning and safe operation under stochastic constraints. The approach, named Risk-Aware Neural Contextual Bandit (RANCB), supports continuous actions and can adjust the risk-performance trade-off through \alpha, potentially per constraint, to meet application requirements. Empirical results in synthetic environments and a 5G resource-allocation use case show improved constraint satisfaction with controllable performance costs; SafeOPT serves as a relevant baseline but suffers from computational and practical limitations in real-time settings.

Abstract

Recent advances in learning techniques have garnered attention for their applicability to a diverse range of real-world sequential decision-making problems. Yet, many practical applications have critical constraints for operation in real environments. Most learning solutions often neglect the risk of failing to meet these constraints, hindering their implementation in real-world contexts. In this paper, we propose a risk-aware decision-making framework for contextual bandit problems, accommodating constraints and continuous action spaces. Our approach employs an actor multi-critic architecture, with each critic characterizing the distribution of performance and constraint metrics. Our framework is designed to cater to various risk levels, effectively balancing constraint satisfaction against performance. To demonstrate the effectiveness of our approach, we first compare it against state-of-the-art baseline methods in a synthetic environment, highlighting the impact of intrinsic environmental noise across different risk configurations. Finally, we evaluate our framework in a real-world use case involving a 5G mobile network where only our approach consistently satisfies the system constraint (a signal processing reliability target) with a small performance toll (8.5% increase in power consumption).
Paper Structure (20 sections, 12 equations, 13 figures, 1 algorithm)

This paper contains 20 sections, 12 equations, 13 figures, 1 algorithm.

Figures (13)

  • Figure 1: Risk aware decision-making framework comprising a deterministic actor, $M+1$ distributional critics and the aggregation function detailed in eq. \ref{['eq:ragg']}. The propagation of the gradient to train the actor is shown in green.
  • Figure 2: Representation of the synthetic environment defined in eq. \ref{['eq:syn_env']} for a fixed context $s=(0.7, 0.7, 0.7)$ and $\sigma_{\text{env}} = 0.15$. We depict the 84.1$^{\text{th}}$, 97.7$^{\text{th}}$, and 99.9$^{\text{th}}$ quantiles of the functions with different transparency levels and their corresponding sets of feasible actions. The markers show the optimal values of the unconstrained (grey) and constrained (black) problems.
  • Figure 3: Evaluation of training phase in synthetic environment with $\sigma_{\text{env}} = 0.2$. Accumulated constraint violation $\Gamma_t$ (left); instantaneous reward $-r_t(s_t, a_t)$ (right).
  • Figure 4: Evaluation of execution time in a Intel i7-11700 @ 2.5GHz and 15Gb or RAM.
  • Figure 5: Evaluation of inference performance. Average constraint violation per step as a function of $\sigma_{\text{env}}$.
  • ...and 8 more figures