Table of Contents
Fetching ...

Bootstrapping Expectiles in Reinforcement Learning

Pierre Clavier, Emmanuel Rachelson, Erwan Le Pennec, Matthieu Geist

TL;DR

The paper addresses overestimation and robustness in reinforcement learning by replacing the standard Bellman expectation with an $\alpha$-expectile, yielding a pessimistic yet contraction-preserving update via the Expectile Bellman Operator. It introduces ExpectRL, a single-critic alternative to the TD3 twin-critic approach, and extends it with Domain Randomization to form a viable robust RL framework; it also proposes AutoExpectRL, which uses bandit-based auto-tuning of $\alpha$. Empirical results in MuJoCo show ExpectRL often outperforms twin-critic baselines and that DR+ExpectRL approaches state-of-the-art robustness on several benchmarks, with AutoExpectRL offering near-parity performance without extra tuning. The work demonstrates a simple, principled route to risk-aware and robust RL that avoids extensive hyperparameter tuning and heavy resampling, with potential to extend to other RL algorithms and settings.

Abstract

Many classic Reinforcement Learning (RL) algorithms rely on a Bellman operator, which involves an expectation over the next states, leading to the concept of bootstrapping. To introduce a form of pessimism, we propose to replace this expectation with an expectile. In practice, this can be very simply done by replacing the $L_2$ loss with a more general expectile loss for the critic. Introducing pessimism in RL is desirable for various reasons, such as tackling the overestimation problem (for which classic solutions are double Q-learning or the twin-critic approach of TD3) or robust RL (where transitions are adversarial). We study empirically these two cases. For the overestimation problem, we show that the proposed approach, ExpectRL, provides better results than a classic twin-critic. On robust RL benchmarks, involving changes of the environment, we show that our approach is more robust than classic RL algorithms. We also introduce a variation of ExpectRL combined with domain randomization which is competitive with state-of-the-art robust RL agents. Eventually, we also extend \ExpectRL with a mechanism for choosing automatically the expectile value, that is the degree of pessimism

Bootstrapping Expectiles in Reinforcement Learning

TL;DR

The paper addresses overestimation and robustness in reinforcement learning by replacing the standard Bellman expectation with an -expectile, yielding a pessimistic yet contraction-preserving update via the Expectile Bellman Operator. It introduces ExpectRL, a single-critic alternative to the TD3 twin-critic approach, and extends it with Domain Randomization to form a viable robust RL framework; it also proposes AutoExpectRL, which uses bandit-based auto-tuning of . Empirical results in MuJoCo show ExpectRL often outperforms twin-critic baselines and that DR+ExpectRL approaches state-of-the-art robustness on several benchmarks, with AutoExpectRL offering near-parity performance without extra tuning. The work demonstrates a simple, principled route to risk-aware and robust RL that avoids extensive hyperparameter tuning and heavy resampling, with potential to extend to other RL algorithms and settings.

Abstract

Many classic Reinforcement Learning (RL) algorithms rely on a Bellman operator, which involves an expectation over the next states, leading to the concept of bootstrapping. To introduce a form of pessimism, we propose to replace this expectation with an expectile. In practice, this can be very simply done by replacing the loss with a more general expectile loss for the critic. Introducing pessimism in RL is desirable for various reasons, such as tackling the overestimation problem (for which classic solutions are double Q-learning or the twin-critic approach of TD3) or robust RL (where transitions are adversarial). We study empirically these two cases. For the overestimation problem, we show that the proposed approach, ExpectRL, provides better results than a classic twin-critic. On robust RL benchmarks, involving changes of the environment, we show that our approach is more robust than classic RL algorithms. We also introduce a variation of ExpectRL combined with domain randomization which is competitive with state-of-the-art robust RL agents. Eventually, we also extend \ExpectRL with a mechanism for choosing automatically the expectile value, that is the degree of pessimism
Paper Structure (33 sections, 4 theorems, 25 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 33 sections, 4 theorems, 25 equations, 8 figures, 6 tables, 1 algorithm.

Key Result

Theorem 4.1

The (optimal) Expectile Bellman Operators are $\gamma$-contractions for the sup norm. (proof in Appx. contraction).

Figures (8)

  • Figure 1: Mean performance as a function of the expectile, non-robust case (corresponding to Table \ref{['nominal']}).
  • Figure 2: Learning curves non-robust case (corresponding to Table \ref{['nominal']}).
  • Figure 3: Min performance as a function of the expectile, robust case (corresponding to Table \ref{['classical']}).
  • Figure 4: Min performance as a function of the expectile, robust case (corresponding to Table \ref{['classical']}).
  • Figure 5: Min performance as a function of the expectile, robust case (corresponding to Table \ref{['classical']}).
  • ...and 3 more figures

Theorems & Definitions (5)

  • Theorem 4.1
  • Theorem 4.2
  • Theorem A.1
  • Theorem A.2
  • proof