Table of Contents
Fetching ...

Extreme Q-Learning: MaxEnt RL without Entropy

Divyansh Garg, Joey Hejna, Matthieu Geist, Stefano Ermon

TL;DR

Extreme Q-Learning introduces a novel EVT-based framework to directly estimate the soft-optimal value function in MaxEnt RL without sampling from a policy. By modeling Gumbel-distributed errors in Bellman backups, it derives a Gumbel regression objective that yields LogSumExp values and a practical, entropy-free approach to MaxEnt RL applicable to online and offline settings. The method demonstrates strong offline performance on D4RL benchmarks (notably Franka Kitchen) and competitive online results on DM Control, while connecting soft-Q learning with conservative Q-learning through KL-based conservatism. Overall, XQL offers a simpler, principled alternative to policy-centric MaxEnt methods with robust performance gains across domains.

Abstract

Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains with an infinite number of possible actions. In this work, we introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT), drawing inspiration from economics. By doing so, we avoid computing Q-values using out-of-distribution actions which is often a substantial source of error. Our key insight is to introduce an objective that directly estimates the optimal soft-value functions (LogSumExp) in the maximum entropy RL setting without needing to sample from a policy. Using EVT, we derive our \emph{Extreme Q-Learning} framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy. Our method obtains consistently strong performance in the D4RL benchmark, outperforming prior works by \emph{10+ points} on the challenging Franka Kitchen tasks while offering moderate improvements over SAC and TD3 on online DM Control tasks. Visualizations and code can be found on our website at https://div99.github.io/XQL/.

Extreme Q-Learning: MaxEnt RL without Entropy

TL;DR

Extreme Q-Learning introduces a novel EVT-based framework to directly estimate the soft-optimal value function in MaxEnt RL without sampling from a policy. By modeling Gumbel-distributed errors in Bellman backups, it derives a Gumbel regression objective that yields LogSumExp values and a practical, entropy-free approach to MaxEnt RL applicable to online and offline settings. The method demonstrates strong offline performance on D4RL benchmarks (notably Franka Kitchen) and competitive online results on DM Control, while connecting soft-Q learning with conservative Q-learning through KL-based conservatism. Overall, XQL offers a simpler, principled alternative to policy-centric MaxEnt methods with robust performance gains across domains.

Abstract

Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains with an infinite number of possible actions. In this work, we introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT), drawing inspiration from economics. By doing so, we avoid computing Q-values using out-of-distribution actions which is often a substantial source of error. Our key insight is to introduce an objective that directly estimates the optimal soft-value functions (LogSumExp) in the maximum entropy RL setting without needing to sample from a policy. Using EVT, we derive our \emph{Extreme Q-Learning} framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy. Our method obtains consistently strong performance in the D4RL benchmark, outperforming prior works by \emph{10+ points} on the challenging Franka Kitchen tasks while offering moderate improvements over SAC and TD3 on online DM Control tasks. Visualizations and code can be found on our website at https://div99.github.io/XQL/.
Paper Structure (35 sections, 11 theorems, 47 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 35 sections, 11 theorems, 47 equations, 7 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

For i.i.d. random variables $X_1,..., X_n \sim f_X$, with exponential tails, $\lim_{n \rightarrow \infty} \max_{i}({X_i})$ follows the Gumbel (GEV-1) distribution. Furthermore, $\mathcal{G}$ is max-stable, i.e. if $X_i \sim \mathcal{G}$, then $\max_i(X_i) \sim \mathcal{G}$ holds.

Figures (7)

  • Figure 1: Bellman errors from SAC on Cheetah-Run tassa2018deepmind. The Gumbel distribution better captures the skew versus the Gaussian. Plots for TD3 and more environments can be found in Appendix \ref{['app:experiments']}.
  • Figure 2: Left: The pdf of the Gumbel distribution with $\mu = 0$ and different values of $\beta$. Center: Our Gumbel loss for different values of $\beta$. Right: Gumbel regression applied to a two-dimensional random variable for different values of $\beta$. The smaller the value of $\beta$, the more the regression fits the extrema.
  • Figure 3: Results on the DM Control for SAC and TD3 based versions of Extreme Q Learning.
  • Figure 4: Here we show the effect of using different ways of fitting the value function on a toy grid world, where the agents goal is to navigate from the beginning of the maze on the bottom left to the end of the maze on the top left. The color of each square shows the learned value. As the environment is discrete, we can investigate how well Gumbel Regression fits the maximum of the Q-values. As seen, when MSE loss is used instead of Gumbel regression, the resulting policy is poor at the beginning and the learned values fail to propagate. As we increase the value of beta, we see that the learned values begin to better approximate the optimal max Q policy shown on the very right.
  • Figure 5: Additional plots of the error distributions of SAC for different environments. We find that the Gumbel distribution strongly fit the errors in first two environments, Cheetah and Walker, but provides a worse fit in the Hopper environment. Nonetheless, we see performance improvements in Hopper using our approach.
  • ...and 2 more figures

Theorems & Definitions (15)

  • Theorem 1: Extreme Value Theorem (EVT) mood1950introductionfisher_tippett_1928
  • Lemma 3.1
  • Lemma 3.2
  • Lemma 3.3
  • Lemma 3.4
  • Lemma 3.5
  • Lemma A.1
  • Corollary A.1.1
  • proof
  • Lemma B.1
  • ...and 5 more