Table of Contents
Fetching ...

Risk-Sensitive Soft Actor-Critic for Robust Deep Reinforcement Learning under Distribution Shifts

Tobias Enders, James Harrison, Maximilian Schiffer

TL;DR

The paper tackles robustness of deep reinforcement learning under distribution shifts in contextual multi-stage stochastic problems by introducing a discrete-action risk-sensitive SAC based on the entropic risk measure. It derives a Bellman equation for $\overline Q^\pi = e^{\beta Q^\pi}$, establishes a policy-improvement result, and implements a practical off-policy algorithm that learns from single trajectories while allowing explicit control over the consistency-robustness tradeoff via $\beta$ and $\alpha$. Empirical evaluation in a grid-world–style environment shows that the risk-sensitive approach improves robustness to distribution shifts compared to risk-neutral SAC and two baselines (data manipulation and entropy regularization), with the best results achieved for moderate risk-aversion. The work provides a structured analysis of robustness under distribution shifts in contextual multi-stage optimization and releases code to enable replication and further development.

Abstract

We study the robustness of deep reinforcement learning algorithms against distribution shifts within contextual multi-stage stochastic combinatorial optimization problems from the operations research domain. In this context, risk-sensitive algorithms promise to learn robust policies. While this field is of general interest to the reinforcement learning community, most studies up-to-date focus on theoretical results rather than real-world performance. With this work, we aim to bridge this gap by formally deriving a novel risk-sensitive deep reinforcement learning algorithm while providing numerical evidence for its efficacy. Specifically, we introduce discrete Soft Actor-Critic for the entropic risk measure by deriving a version of the Bellman equation for the respective Q-values. We establish a corresponding policy improvement result and infer a practical algorithm. We introduce an environment that represents typical contextual multi-stage stochastic combinatorial optimization problems and perform numerical experiments to empirically validate our algorithm's robustness against realistic distribution shifts, without compromising performance on the training distribution. We show that our algorithm is superior to risk-neutral Soft Actor-Critic as well as to two benchmark approaches for robust deep reinforcement learning. Thereby, we provide the first structured analysis on the robustness of reinforcement learning under distribution shifts in the realm of contextual multi-stage stochastic combinatorial optimization problems.

Risk-Sensitive Soft Actor-Critic for Robust Deep Reinforcement Learning under Distribution Shifts

TL;DR

The paper tackles robustness of deep reinforcement learning under distribution shifts in contextual multi-stage stochastic problems by introducing a discrete-action risk-sensitive SAC based on the entropic risk measure. It derives a Bellman equation for , establishes a policy-improvement result, and implements a practical off-policy algorithm that learns from single trajectories while allowing explicit control over the consistency-robustness tradeoff via and . Empirical evaluation in a grid-world–style environment shows that the risk-sensitive approach improves robustness to distribution shifts compared to risk-neutral SAC and two baselines (data manipulation and entropy regularization), with the best results achieved for moderate risk-aversion. The work provides a structured analysis of robustness under distribution shifts in contextual multi-stage optimization and releases code to enable replication and further development.

Abstract

We study the robustness of deep reinforcement learning algorithms against distribution shifts within contextual multi-stage stochastic combinatorial optimization problems from the operations research domain. In this context, risk-sensitive algorithms promise to learn robust policies. While this field is of general interest to the reinforcement learning community, most studies up-to-date focus on theoretical results rather than real-world performance. With this work, we aim to bridge this gap by formally deriving a novel risk-sensitive deep reinforcement learning algorithm while providing numerical evidence for its efficacy. Specifically, we introduce discrete Soft Actor-Critic for the entropic risk measure by deriving a version of the Bellman equation for the respective Q-values. We establish a corresponding policy improvement result and infer a practical algorithm. We introduce an environment that represents typical contextual multi-stage stochastic combinatorial optimization problems and perform numerical experiments to empirically validate our algorithm's robustness against realistic distribution shifts, without compromising performance on the training distribution. We show that our algorithm is superior to risk-neutral Soft Actor-Critic as well as to two benchmark approaches for robust deep reinforcement learning. Thereby, we provide the first structured analysis on the robustness of reinforcement learning under distribution shifts in the realm of contextual multi-stage stochastic combinatorial optimization problems.
Paper Structure (30 sections, 2 theorems, 26 equations, 8 figures, 1 algorithm)

This paper contains 30 sections, 2 theorems, 26 equations, 8 figures, 1 algorithm.

Key Result

Proposition 1

For the risk-sensitive objective in Equation eq:objective, $\gamma$ close to one, and with $\overline Q$ as defined in Equation eq:Qbar, it holds that

Figures (8)

  • Figure 1: Illustration of the environment. The gray cell is the target location.
  • Figure 2: Per-time-step probability that an item appears in the respective cell, for twelve different item distributions.
  • Figure 3: Convergence behavior of our risk-sensitive algorithm with $\beta=-1$ compared to risk-neutral . For each algorithm, we show the training curves for three different random seeds. The non-transparent lines correspond to the best-performing seed, the transparent ones to the other seeds.
  • Figure 4: Performance of the three approaches to improve robustness. The top row shows their performance on the test data for the training distribution. The bottom row shows the average over all other distributions, i.e., the performance under distribution shifts. The left-most data point in each plot shows the results for the non-robust discrete algorithm. We report all results relative to 's performance when trained on the shifted, i.e., the true distribution as explained in Section \ref{['sec:performance_evaluation']}.
  • Figure 5: Consistency-robustness tradeoff for entropy regularization with $\alpha=0.05$ and our risk-sensitive algorithm with $\beta=-2$. Using the performance under distribution shifts vs. non-robust trained on the shifted distributions as the performance metric, we show the weighted average of the performance on the training distribution and the performance under distribution shifts.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Proposition 1: Bellman equation
  • proof
  • Proposition 2: Policy improvement
  • proof