Table of Contents
Fetching ...

Risk-Averse Reinforcement Learning with Itakura-Saito Loss

Igor Udovichenko, Olivier Croissant, Anita Toleutaeva, Evgeny Burnaev, Alexander Korotin

TL;DR

The paper tackles risk-averse reinforcement learning by leveraging exponential utility and introduces a numerically stable Itakura-Saito loss for learning risk-sensitive value functions. Grounded in Bregman divergence, the loss yields a risk-averse Bellman update and a stochastic approximation rule, enabling stable training where prior exponential-style losses struggle. Empirical results across analytically tractable portfolio problems, Deep Hedging, and robust combinatorial optimization show that Itakura-Saito loss often outperforms alternatives in stability and accuracy, especially at larger risk aversion levels. The work also offers a theoretical perspective linking the loss to conformal invariance and improved optimization conditioning, suggesting broad implications for robust learning under uncertainty.

Abstract

Risk-averse reinforcement learning finds application in various high-stakes fields. Unlike classical reinforcement learning, which aims to maximize expected returns, risk-averse agents choose policies that minimize risk, occasionally sacrificing expected value. These preferences can be framed through utility theory. We focus on the specific case of the exponential utility function, where one can derive the Bellman equations and employ various reinforcement learning algorithms with few modifications. To address this, we introduce to the broad machine learning community a numerically stable and mathematically sound loss function based on the Itakura-Saito divergence for learning state-value and action-value functions. We evaluate the Itakura-Saito loss function against established alternatives, both theoretically and empirically. In the experimental section, we explore multiple scenarios, some with known analytical solutions, and show that the considered loss function outperforms the alternatives.

Risk-Averse Reinforcement Learning with Itakura-Saito Loss

TL;DR

The paper tackles risk-averse reinforcement learning by leveraging exponential utility and introduces a numerically stable Itakura-Saito loss for learning risk-sensitive value functions. Grounded in Bregman divergence, the loss yields a risk-averse Bellman update and a stochastic approximation rule, enabling stable training where prior exponential-style losses struggle. Empirical results across analytically tractable portfolio problems, Deep Hedging, and robust combinatorial optimization show that Itakura-Saito loss often outperforms alternatives in stability and accuracy, especially at larger risk aversion levels. The work also offers a theoretical perspective linking the loss to conformal invariance and improved optimization conditioning, suggesting broad implications for robust learning under uncertainty.

Abstract

Risk-averse reinforcement learning finds application in various high-stakes fields. Unlike classical reinforcement learning, which aims to maximize expected returns, risk-averse agents choose policies that minimize risk, occasionally sacrificing expected value. These preferences can be framed through utility theory. We focus on the specific case of the exponential utility function, where one can derive the Bellman equations and employ various reinforcement learning algorithms with few modifications. To address this, we introduce to the broad machine learning community a numerically stable and mathematically sound loss function based on the Itakura-Saito divergence for learning state-value and action-value functions. We evaluate the Itakura-Saito loss function against established alternatives, both theoretically and empirically. In the experimental section, we explore multiple scenarios, some with known analytical solutions, and show that the considered loss function outperforms the alternatives.

Paper Structure

This paper contains 36 sections, 3 theorems, 56 equations, 4 figures, 1 table.

Key Result

Proposition 1

Under mild assumptions the value function that minimizes eq:is_loss satisfies eq:evv.

Figures (4)

  • Figure 1: Comparison of loss penalties for a one-step value prediction error $\delta_{\tilde{V}}(\theta)$ when $\alpha=1$. A positive $\delta_{\tilde{V}}(\theta) > 0$ means the current estimate $\tilde{V}_\theta(s)$ underestimates the true CE value (the return is higher than expected). Risk-averse losses heavily penalize underestimation ($\delta_{\tilde{V}}(\theta) > 0$) since underestimating the value implies unaccounted risk, whereas overestimation ($\delta_{\tilde{V}}(\theta)< 0$) is penalized less. MSE, being risk-neutral, is symmetric. EMSE (exponential MSE) grows with the absolute value of $V$, leading to numerical instability for large values.
  • Figure 2: Error in learning the obtained approximation of $V^*$ in the Gaussian and quadratic cases. Each experiment was run five times with different random seeds. In the Gaussian case, losses perform on par. Loss \ref{['eq:soft_plus']} does not learn the correct value function for the non-Gaussian return.
  • Figure 3: Loss performance on the Deep Hedging problem deep_hedging. Loss \ref{['eq:is_loss']} shows more stable and reliable convergence than the alternatives.
  • Figure 4: Loss performance on the RSSAC problem enders2024risk. Learning curves depict the mean validation return during the training process. Each line represents the average over three random seeds, with shaded areas indicating $\pm 1$ standard deviation. The \ref{['eq:exp_mse']} loss destabilizes training.

Theorems & Definitions (5)

  • Proposition 1
  • Proposition
  • proof
  • Theorem : Improved Spectral Conditioning under Conformal Invariance
  • proof : Sketch of proof