Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning

Kwanyoung Park; Youngwoon Lee

Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning

Kwanyoung Park, Youngwoon Lee

TL;DR

This work addresses offline reinforcement learning with model-based data augmentation, where naive model rollouts can yield biased value estimates. It introduces Lower Expectile Q-learning (LEQ), which uses lower expectile regression ($\tau<0.5$) on multi-step $Q$-targets and optimizes both the critic and policy with $\lambda$-returns from imaged trajectories to achieve conservative, low-bias estimates. Empirically, LEQ achieves strong performance on long-horizon AntMaze tasks, competitive results on MuJoCo locomotion and vision-based benchmarks, and notable robustness across diverse domains, with ablations confirming the importance of lower expectile, $\lambda$-returns, and offline critic training. The approach provides a practical, scalable alternative to uncertainty-based penalties and demonstrates the value of conservative learning signals in model-based offline RL.

Abstract

Model-based offline reinforcement learning (RL) is a compelling approach that addresses the challenge of learning from limited, static data by generating imaginary trajectories using learned models. However, these approaches often struggle with inaccurate value estimation from model rollouts. In this paper, we introduce a novel model-based offline RL method, Lower Expectile Q-learning (LEQ), which provides a low-bias model-based value estimation via lower expectile regression of $λ$-returns. Our empirical results show that LEQ significantly outperforms previous model-based offline RL methods on long-horizon tasks, such as the D4RL AntMaze tasks, matching or surpassing the performance of model-free approaches and sequence modeling approaches. Furthermore, LEQ matches the performance of state-of-the-art model-based and model-free methods in dense-reward environments across both state-based tasks (NeoRL and D4RL) and pixel-based tasks (V-D4RL), showing that LEQ works robustly across diverse domains. Our ablation studies demonstrate that lower expectile regression, $λ$-returns, and critic training on offline data are all crucial for LEQ.

Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning

TL;DR

) on multi-step

-targets and optimizes both the critic and policy with

-returns from imaged trajectories to achieve conservative, low-bias estimates. Empirically, LEQ achieves strong performance on long-horizon AntMaze tasks, competitive results on MuJoCo locomotion and vision-based benchmarks, and notable robustness across diverse domains, with ablations confirming the importance of lower expectile,

-returns, and offline critic training. The approach provides a practical, scalable alternative to uncertainty-based penalties and demonstrates the value of conservative learning signals in model-based offline RL.

Abstract

-returns. Our empirical results show that LEQ significantly outperforms previous model-based offline RL methods on long-horizon tasks, such as the D4RL AntMaze tasks, matching or surpassing the performance of model-free approaches and sequence modeling approaches. Furthermore, LEQ matches the performance of state-of-the-art model-based and model-free methods in dense-reward environments across both state-based tasks (NeoRL and D4RL) and pixel-based tasks (V-D4RL), showing that LEQ works robustly across diverse domains. Our ablation studies demonstrate that lower expectile regression,

-returns, and critic training on offline data are all crucial for LEQ.

Paper Structure (48 sections, 3 theorems, 16 equations, 8 figures, 23 tables, 2 algorithms)

This paper contains 48 sections, 3 theorems, 16 equations, 8 figures, 23 tables, 2 algorithms.

Introduction
Related Work
Preliminaries
Problem setup.
Model-based offline RL.
Expectile regression.
Approach
Lower expectile Q-learning
Lower expectile Q-learning with $\lambda$-return
Lower expectile policy learning with $\lambda$-return
Expanding dataset with model rollouts
Experiments
Tasks
Compared offline RL algorithms
Model-free offline RL.
...and 33 more sections

Key Result

Lemma 1

Let $X$ be a distribution and $Y = E^{\tau}[X]$ be a lower expectile of $X$ (i.e. $0 < \tau \leq 0.5$). Let $\hat{Y}$ be an arbitrary optimistic estimate of $Y$ (i.e., $\hat{Y} \geq Y$), and define $W^{\tau}(\cdot) = |\tau - \mathbbm{1}(\cdot)|$. If we let $\hat{Y}_{\text{new}} = \frac{\mathbb{E}[W^

Figures (8)

Figure 1: Lower Expectile Q-learning (LEQ).(left) In model-based offline RL, an agent can generate imaginary trajectories using a world model. (right) For conservative Q-evaluation of the policy, LEQ learns the lower expectile of the target $Q$-distribution from a few sampled rollouts $\mathcal{T}_i$, without estimating the entire Q-distribution with exhaustive rollouts.
Figure 2: Comparison of standard Q-learning and Lower Expectile Q-learning (LEQ). LEQ generalizes standard Q-learning (with $\lambda$-returns $Q_t^{\lambda}(\mathcal{T})$) by multiplying a simple asymmetric weight "$\lvert \tau - \mathbbm{1}(Q_t^{\lambda}(\mathcal{T}) > Q_{\phi}(s_t, a_t)) \rvert$" to the Q-learning objectives. $\mathcal{T} = (\mathbf{s}_0, \mathbf{a}_0, r_0, \mathbf{s}_1, \mathbf{a}_1, r_1, \cdots, \mathbf{s}_T)$ is a model-generated trajectory and $\tau \leq 0.5$ is the expectile hyperparameter controlling the degree of conservatism. When $\tau = 0.5$, LEQ reduces to standard Q-learning.
Figure 3: AntMaze tasks.
Figure 4: Locomotion tasks.
Figure 5: Failure in medium mazes. The agent plans impossible trajectories on certain states (red circles).
...and 3 more figures

Theorems & Definitions (6)

Lemma 1
proof
Lemma 2
proof
Theorem 1
proof

Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning

TL;DR

Abstract

Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (6)