Table of Contents
Fetching ...

Risk-Sensitive Reinforcement Learning with Exponential Criteria

Erfaun Noorani, Christos Mavridis, John Baras

TL;DR

This work addresses robustness in reinforcement learning under model perturbations by formulating a risk-sensitive objective using exponential criteria. It develops two online methods, a model-free risk-sensitive REINFORCE variant and a risk-sensitive online Actor-Critic (R-AC) built on a multiplicative Bellman equation, with convergence and implementation insights. Empirical results on Cart-Pole and Acrobot show that choosing $\beta<0$ (risk-averse) or $\beta>0$ (risk-seeking) can reduce tail risk and improve robustness and sample efficiency, while maintaining competitive mean performance. Overall, exponential criteria offer a principled route to robust, more reliable RL in noisy or uncertain environments.

Abstract

While reinforcement learning has shown experimental success in a number of applications, it is known to be sensitive to noise and perturbations in the parameters of the system, leading to high variance in the total reward amongst different episodes in slightly different environments. To introduce robustness, as well as sample efficiency, risk-sensitive reinforcement learning methods are being thoroughly studied. In this work, we provide a definition of robust reinforcement learning policies and formulate a risk-sensitive reinforcement learning problem to approximate them, by solving an optimization problem with respect to a modified objective based on exponential criteria. In particular, we study a model-free risk-sensitive variation of the widely-used Monte Carlo Policy Gradient algorithm and introduce a novel risk-sensitive online Actor-Critic algorithm based on solving a multiplicative Bellman equation using stochastic approximation updates. Analytical results suggest that the use of exponential criteria generalizes commonly used ad-hoc regularization approaches, improves sample efficiency, and introduces robustness with respect to perturbations in the model parameters and the environment. The implementation, performance, and robustness properties of the proposed methods are evaluated in simulated experiments.

Risk-Sensitive Reinforcement Learning with Exponential Criteria

TL;DR

This work addresses robustness in reinforcement learning under model perturbations by formulating a risk-sensitive objective using exponential criteria. It develops two online methods, a model-free risk-sensitive REINFORCE variant and a risk-sensitive online Actor-Critic (R-AC) built on a multiplicative Bellman equation, with convergence and implementation insights. Empirical results on Cart-Pole and Acrobot show that choosing (risk-averse) or (risk-seeking) can reduce tail risk and improve robustness and sample efficiency, while maintaining competitive mean performance. Overall, exponential criteria offer a principled route to robust, more reliable RL in noisy or uncertain environments.

Abstract

While reinforcement learning has shown experimental success in a number of applications, it is known to be sensitive to noise and perturbations in the parameters of the system, leading to high variance in the total reward amongst different episodes in slightly different environments. To introduce robustness, as well as sample efficiency, risk-sensitive reinforcement learning methods are being thoroughly studied. In this work, we provide a definition of robust reinforcement learning policies and formulate a risk-sensitive reinforcement learning problem to approximate them, by solving an optimization problem with respect to a modified objective based on exponential criteria. In particular, we study a model-free risk-sensitive variation of the widely-used Monte Carlo Policy Gradient algorithm and introduce a novel risk-sensitive online Actor-Critic algorithm based on solving a multiplicative Bellman equation using stochastic approximation updates. Analytical results suggest that the use of exponential criteria generalizes commonly used ad-hoc regularization approaches, improves sample efficiency, and introduces robustness with respect to perturbations in the model parameters and the environment. The implementation, performance, and robustness properties of the proposed methods are evaluated in simulated experiments.
Paper Structure (18 sections, 5 theorems, 59 equations, 10 figures, 2 algorithms)

This paper contains 18 sections, 5 theorems, 59 equations, 10 figures, 2 algorithms.

Key Result

Theorem 1

Consider a measurable space $(\Omega,\mathcal{F})$, where $\mathcal{F}$ is a $\sigma$-algebra on $\Omega$. Let $\mathcal{P}(\Omega)$ be a set of probability measures $P:\Omega\rightarrow [0,1]$, and $P_\mu,P_\nu\in\mathcal{P}(\Omega)$. In addition, consider a bounded measurable function $Z:\Omega\ri and the KL divergence measure: are in duality with respect to a Legendre-type transform, in the fo

Figures (10)

  • Figure 1: Generalization performance with respect to perturbations in the model parameters. Risk-neutral (left) and risk-sensitive (right) actor-critic reinforcement learning algorithms trained in the Cart-Pole environment with pole length $l=0.5$ are tested for different pole length values $l\in\left[ 0.2,0.8\right]$. Average reward and $90\%$ confidence intervals over a running window of $10$ episodes are depicted.
  • Figure 2: Training and testing behavior of the risk-neutral REINFORCE algorithm against the proposed risk-sensitive R-REINFORCE algorithm (Alg. \ref{['alg:RiskSensitiveREINFORCE']}) for $\beta=-0.1$ and $\beta=+0.1$ in the Cart-Pole problem. Average reward, CVaR$_{0.1}$, and CVaR$_{0.9}$ values (for $l=0.5$) are computed over $10$ independent training and testing runs with different random seeds.
  • Figure 3: Robustness of risk-neutral REINFORCE and risk-sensitive R-REINFORCE (Alg. \ref{['alg:RiskSensitiveREINFORCE']}) algorithms in a cart-pole environment with respect to varying pole length. The training environment is modeled with pole length $l=0.5$. The testing environments have perturbed pole length values of $l\in\left[ 0.2,0.8\right]$. Average reward, CVaR$_{0.1}$, and CVaR$_{0.9}$ values are computed over $10$ independent training and testing runs with different random seeds.
  • Figure 4: Training and testing behavior of the risk-neutral Online Actor-Critic (OAC) algorithm against the proposed risk-sensitive R-AC algorithm (Alg. \ref{['alg:RiskSensitiveActorCritic']}) for $\beta=-0.001$ and $\beta=+0.005$ in the Cart-Pole problem. Average reward, CVaR$_{0.1}$, and CVaR$_{0.9}$ values (for $l=0.5$) are computed over $10$ independent training and testing runs with different random seeds.
  • Figure 5: Robustness of risk-neutral Online Actor-Critic (OAC) and risk-sensitive R-AC (Alg. \ref{['alg:RiskSensitiveActorCritic']}) algorithms in a cart-pole environment with respect to varying pole length. The training environment is modeled with pole length $l=0.5$. The testing environments have perturbed pole length values of $l\in\left[ 0.2,0.8\right]$. Average reward, CVaR$_{0.1}$, and CVaR$_{0.9}$ values are computed over $10$ independent training and testing runs with different random seeds.
  • ...and 5 more figures

Theorems & Definitions (13)

  • Theorem 1
  • proof
  • Corollary 1.1
  • Remark 1
  • Definition 1
  • Theorem 2
  • proof
  • Remark 2
  • Theorem 3: Noorani2021RR
  • Corollary 3.1
  • ...and 3 more