Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization

Haoran Xu; Li Jiang; Jianxiong Li; Zhuoran Yang; Zhaoran Wang; Victor Wai Kin Chan; Xianyuan Zhan

Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization

Haoran Xu, Li Jiang, Jianxiong Li, Zhuoran Yang, Zhaoran Wang, Victor Wai Kin Chan, Xianyuan Zhan

TL;DR

The paper introduces Implicit Value Regularization (IVR), a framework to address distributional shift in offline RL by enforcing in-sample learning through behavior-regularized backups. Building on IVR, the authors derive two practical algorithms, Sparse Q-Learning (SQL) and Exponential Q-Learning (EQL), which impose value regularization to induce sparsity or exponential weighting in the learned value function. The methods learn Q and V entirely from in-sample data and avoid querying unseen actions, achieving state-of-the-art results on challenging D4RL tasks such as AntMaze and Kitchen, and exhibiting robustness in noisy and small-data regimes. The work connects to CQL and IQL, providing a unified lens for behavior-regularized offline RL and offering a path toward more stable, data-efficient offline-to-online extensions.

Abstract

Most offline reinforcement learning (RL) methods suffer from the trade-off between improving the policy to surpass the behavior policy and constraining the policy to limit the deviation from the behavior policy as computing $Q$-values using out-of-distribution (OOD) actions will suffer from errors due to distributional shift. The recently proposed \textit{In-sample Learning} paradigm (i.e., IQL), which improves the policy by quantile regression using only data samples, shows great promise because it learns an optimal policy without querying the value function of any unseen actions. However, it remains unclear how this type of method handles the distributional shift in learning the value function. In this work, we make a key finding that the in-sample learning paradigm arises under the \textit{Implicit Value Regularization} (IVR) framework. This gives a deeper understanding of why the in-sample learning paradigm works, i.e., it applies implicit value regularization to the policy. Based on the IVR framework, we further propose two practical algorithms, Sparse $Q$-learning (SQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works, but in a complete in-sample manner. Compared with IQL, we find that our algorithms introduce sparsity in learning the value function, making them more robust in noisy data regimes. We also verify the effectiveness of SQL and EQL on D4RL benchmark datasets and show the benefits of in-sample learning by comparing them with CQL in small data regimes.

Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization

TL;DR

Abstract

-values using out-of-distribution (OOD) actions will suffer from errors due to distributional shift. The recently proposed \textit{In-sample Learning} paradigm (i.e., IQL), which improves the policy by quantile regression using only data samples, shows great promise because it learns an optimal policy without querying the value function of any unseen actions. However, it remains unclear how this type of method handles the distributional shift in learning the value function. In this work, we make a key finding that the in-sample learning paradigm arises under the \textit{Implicit Value Regularization} (IVR) framework. This gives a deeper understanding of why the in-sample learning paradigm works, i.e., it applies implicit value regularization to the policy. Based on the IVR framework, we further propose two practical algorithms, Sparse

-learning (SQL) and Exponential

-learning (EQL), which adopt the same value regularization used in existing works, but in a complete in-sample manner. Compared with IQL, we find that our algorithms introduce sparsity in learning the value function, making them more robust in noisy data regimes. We also verify the effectiveness of SQL and EQL on D4RL benchmark datasets and show the benefits of in-sample learning by comparing them with CQL in small data regimes.

Paper Structure (22 sections, 4 theorems, 44 equations, 9 figures, 10 tables, 1 algorithm)

This paper contains 22 sections, 4 theorems, 44 equations, 9 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Offline RL with Implicit Value Regularization
Behavior-regularized MDPs
Assumptions and Solutions
Sparse $Q$-Learning (SQL)
Exponential $Q$-Learning (EQL)
Discussions
Experiments
Benchmark Datasets
Noisy Data Regime
Small Data Regime
Conclusions and Future Work
A Statistical View of Why SQL and EQL Work
...and 7 more sections

Key Result

Theorem 1

In the behavior-regularized MDP, any optimal policy $\pi^*$ and its optimal value function $Q^*$ and $V^*$ satisfy the following optimality condition for all states and actions: where $U^*(s)$ is a normalization term so that $\sum_{a \in \mathcal{A}} \pi^*(a | s)=1$.

Figures (9)

Figure 1: Performance of different methods in noisy data regimes.
Figure 2: Left: The loss with respect to the residual ($Q-V$) in the learning objective of $V$ in SQL with different $\alpha$. Center: An example of estimating state conditional extrema of a two-dimensional random variable (generated by adding random noise to samples from $y=\sin(x)$). Each $x$ corresponds to a distribution over $y$. The loss fits the extrema more with $\alpha$ becoming smaller. Right: The comparison of the derivative of loss of SQL and IQL. In SQL, the derivative keeps unchanged when the residual is below a threshold.
Figure 3: Left: The loss with respect to the residual ($Q-V$) in the learning objective of $V$ in EQL with different $\alpha$. Center: An example of estimating state conditional extrema of a two-dimensional random variable (generated by adding random noise to samples from $y=\sin(x)$). Each $x$ corresponds to a distribution over $y$. The loss fits the extrema more with $\alpha$ becoming smaller. Right: The comparison of the derivative of loss of EQL and IQL. In EQL, the derivative softly decreases and keeps (nearly) unchanged when the residual is below a threshold.
Figure 4: Evaluation of IQL and SQL on the Four Rooms environment. SQL learns a more optimal value function and produces a better policy than IQL when the dataset is heavily corrupted by suboptimal actions.
Figure 5: Learning curves of SQL and EQL on D4RL MuJoCo locomotion datasets.
...and 4 more figures

Theorems & Definitions (6)

Theorem 1
Theorem 2
Lemma 1
Lemma 2
proof
proof

Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization

TL;DR

Abstract

Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (6)