Table of Contents
Fetching ...

On the Theory of Risk-Aware Agents: Bridging Actor-Critic and Economics

Michal Nauman, Marek Cygan

TL;DR

This work grounds risk-aware reinforcement learning in decision theory by showing that pessimistic actor-critic updates approximate utility maximization under an exponential utility, enabling a unified interpretation of risk-neutral and risk-aware policies through certainty equivalents. It introduces Dual Actor-Critic (DAC), a model-free architecture with a pessimistic critic/policy for TD learning and an optimistic actor for exploration, coupled with online divergence control via KL penalties. Empirically, DAC achieves significant gains in sample efficiency and final performance across 30 locomotion and manipulation tasks, and matches or exceeds model-based methods in challenging dog and humanoid domains while incurring manageable compute overhead. The results support a practical, theory-grounded approach to balancing exploration and value estimation under uncertainty, with clear avenues for ablations and robustness analyses and a discussion of broader impacts.

Abstract

Risk-aware Reinforcement Learning (RL) algorithms like SAC and TD3 were shown empirically to outperform their risk-neutral counterparts in a variety of continuous-action tasks. However, the theoretical basis for the pessimistic objectives these algorithms employ remains unestablished, raising questions about the specific class of policies they are implementing. In this work, we apply the expected utility hypothesis, a fundamental concept in economics, to illustrate that both risk-neutral and risk-aware RL goals can be interpreted through expected utility maximization using an exponential utility function. This approach reveals that risk-aware policies effectively maximize value certainty equivalent, aligning them with conventional decision theory principles. Furthermore, we propose Dual Actor-Critic (DAC). DAC is a risk-aware, model-free algorithm that features two distinct actor networks: a pessimistic actor for temporal-difference learning and an optimistic actor for exploration. Our evaluations of DAC across various locomotion and manipulation tasks demonstrate improvements in sample efficiency and final performance. Remarkably, DAC, while requiring significantly less computational resources, matches the performance of leading model-based methods in the complex dog and humanoid domains.

On the Theory of Risk-Aware Agents: Bridging Actor-Critic and Economics

TL;DR

This work grounds risk-aware reinforcement learning in decision theory by showing that pessimistic actor-critic updates approximate utility maximization under an exponential utility, enabling a unified interpretation of risk-neutral and risk-aware policies through certainty equivalents. It introduces Dual Actor-Critic (DAC), a model-free architecture with a pessimistic critic/policy for TD learning and an optimistic actor for exploration, coupled with online divergence control via KL penalties. Empirically, DAC achieves significant gains in sample efficiency and final performance across 30 locomotion and manipulation tasks, and matches or exceeds model-based methods in challenging dog and humanoid domains while incurring manageable compute overhead. The results support a practical, theory-grounded approach to balancing exploration and value estimation under uncertainty, with clear avenues for ablations and robustness analyses and a discussion of broader impacts.

Abstract

Risk-aware Reinforcement Learning (RL) algorithms like SAC and TD3 were shown empirically to outperform their risk-neutral counterparts in a variety of continuous-action tasks. However, the theoretical basis for the pessimistic objectives these algorithms employ remains unestablished, raising questions about the specific class of policies they are implementing. In this work, we apply the expected utility hypothesis, a fundamental concept in economics, to illustrate that both risk-neutral and risk-aware RL goals can be interpreted through expected utility maximization using an exponential utility function. This approach reveals that risk-aware policies effectively maximize value certainty equivalent, aligning them with conventional decision theory principles. Furthermore, we propose Dual Actor-Critic (DAC). DAC is a risk-aware, model-free algorithm that features two distinct actor networks: a pessimistic actor for temporal-difference learning and an optimistic actor for exploration. Our evaluations of DAC across various locomotion and manipulation tasks demonstrate improvements in sample efficiency and final performance. Remarkably, DAC, while requiring significantly less computational resources, matches the performance of leading model-based methods in the complex dog and humanoid domains.
Paper Structure (40 sections, 1 theorem, 25 equations, 17 figures, 3 tables)

This paper contains 40 sections, 1 theorem, 25 equations, 17 figures, 3 tables.

Key Result

Theorem 3.1

Denote exponential utility $~\mathcal{U}(V_{i}, \beta) = e^{2\beta V_{i}(s)}$, then the certainty equivalent value $V^{c}(s)$ is approximately equal to the pessimistic value $V^{\beta}(s)$:

Figures (17)

  • Figure 1: We test the proposed approach (DAC) against various risk-aware and risk-neutral actor-critic baselines in $30$ tasks listed in Table \ref{['tab:all_tasks']}. Due to a controlled experimental setup described in Section \ref{['section:experiments']}, the performance differences between algorithms stem solely from their risk-management. $Y$-axis reports IQM with $95$% CI calculated using 10 seeds, with $1.0$ representing the maximal score.
  • Figure 2: We report relative size of critic disagreement to critic means (\ref{['fig:adjust1']}), optimism (\ref{['fig:adjust2']}), KL penalty weight (\ref{['fig:adjust3']}), and the empirical KL between two actors (\ref{['fig:adjust4']}) during $1$mln environment steps training on tasks listed in Table \ref{['tab:all_tasks']}. DAC adjustment mechanism accommodates different scales of Q-values and allows for maintaining a predefined KL divergence between the two policies, despite significant differences between each task. $30$ tasks, $10$ seeds per task.
  • Figure 3: We report final IQM in $30$ tasks (Table \ref{['tab:all_tasks']}) in low ($RR=2$) and high replay settings ($RR=16$). $1.0$ denotes the maximal possible score, $95$% CI calculated with $10$ random seeds.
  • Figure 4: We compare model-free DAC, model-free SR-SAC to model-based TD-MPC. $Y$-axis reports IQM, and $X$-axis denotes environment steps. $3$mln environment steps, $5$ seeds.
  • Figure 5: We evaluate DAC performance when changing the values of its hyperparameters ($X$-axis) for two replay ratio regimes. The bold value denotes the value used in the main experiment. $Y$-axis reports the percentage improvement over tuned SAC. $15$ tasks, $500k$ steps, $10$ seeds per task.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Theorem 3.1