Risk-Aware Continuous Control with Neural Contextual Bandits
Jose A. Ayala-Romero, Andres Garcia-Saavedra, Xavier Costa-Perez
TL;DR
This work tackles risk-aware decision-making in constrained contextual bandits with continuous actions by introducing an actor-multi-critic framework in which each critic models the distribution of a metric (reward and constraints) using distributional methods based on quantile regression. A deterministic actor, guided by an aggregated reward R^{agg}(s,a,\alpha|\eta) that incorporates tail-risk penalties via the \alpha-quantile, enables per-metric risk tuning and safe operation under stochastic constraints. The approach, named Risk-Aware Neural Contextual Bandit (RANCB), supports continuous actions and can adjust the risk-performance trade-off through \alpha, potentially per constraint, to meet application requirements. Empirical results in synthetic environments and a 5G resource-allocation use case show improved constraint satisfaction with controllable performance costs; SafeOPT serves as a relevant baseline but suffers from computational and practical limitations in real-time settings.
Abstract
Recent advances in learning techniques have garnered attention for their applicability to a diverse range of real-world sequential decision-making problems. Yet, many practical applications have critical constraints for operation in real environments. Most learning solutions often neglect the risk of failing to meet these constraints, hindering their implementation in real-world contexts. In this paper, we propose a risk-aware decision-making framework for contextual bandit problems, accommodating constraints and continuous action spaces. Our approach employs an actor multi-critic architecture, with each critic characterizing the distribution of performance and constraint metrics. Our framework is designed to cater to various risk levels, effectively balancing constraint satisfaction against performance. To demonstrate the effectiveness of our approach, we first compare it against state-of-the-art baseline methods in a synthetic environment, highlighting the impact of intrinsic environmental noise across different risk configurations. Finally, we evaluate our framework in a real-world use case involving a 5G mobile network where only our approach consistently satisfies the system constraint (a signal processing reliability target) with a small performance toll (8.5% increase in power consumption).
