Projection Implicit Q-Learning with Support Constraint for Offline Reinforcement Learning

Xinchen Han; Hossam Afifi; Michel Marot

Projection Implicit Q-Learning with Support Constraint for Offline Reinforcement Learning

Xinchen Han, Hossam Afifi, Michel Marot

TL;DR

Proj-IQL tackles offline RL extrapolation by replacing a fixed expectile conservatism with a projection-based adaptive parameter $\tau_{\text{proj}}(a|s)$ and coupling multi-step, in-sample expectile learning with a relaxed, support-constrained policy improvement. Theoretical results establish monotonic policy improvement under nondecreasing $\tau_{\text{proj}}$ and rigorous criteria for identifying superior actions, while practical implementations use clipping, batch-averaging, and SNIS to stabilize training. Empirically, Proj-IQL achieves state-of-the-art performance on D4RL benchmarks, notably in AntMaze-v0 and Kitchen-v0 tasks that require strong stitching capabilities. Overall, the approach provides a data-efficient, theoretically grounded offline RL algorithm with robust improvements over existing methods.

Abstract

Offline Reinforcement Learning (RL) faces a critical challenge of extrapolation errors caused by out-of-distribution (OOD) actions. Implicit Q-Learning (IQL) algorithm employs expectile regression to achieve in-sample learning, effectively mitigating the risks associated with OOD actions. However, the fixed hyperparameter in policy evaluation and density-based policy improvement method limit its overall efficiency. In this paper, we propose Proj-IQL, a projective IQL algorithm enhanced with the support constraint. In the policy evaluation phase, Proj-IQL generalizes the one-step approach to a multi-step approach through vector projection, while maintaining in-sample learning and expectile regression framework. In the policy improvement phase, Proj-IQL introduces support constraint that is more aligned with the policy evaluation approach. Furthermore, we theoretically demonstrate that Proj-IQL guarantees monotonic policy improvement and enjoys a progressively more rigorous criterion for superior actions. Empirical results demonstrate the Proj-IQL achieves state-of-the-art performance on D4RL benchmarks, especially in challenging navigation domains.

Projection Implicit Q-Learning with Support Constraint for Offline Reinforcement Learning

TL;DR

Proj-IQL tackles offline RL extrapolation by replacing a fixed expectile conservatism with a projection-based adaptive parameter

and coupling multi-step, in-sample expectile learning with a relaxed, support-constrained policy improvement. Theoretical results establish monotonic policy improvement under nondecreasing

and rigorous criteria for identifying superior actions, while practical implementations use clipping, batch-averaging, and SNIS to stabilize training. Empirically, Proj-IQL achieves state-of-the-art performance on D4RL benchmarks, notably in AntMaze-v0 and Kitchen-v0 tasks that require strong stitching capabilities. Overall, the approach provides a data-efficient, theoretically grounded offline RL algorithm with robust improvements over existing methods.

Abstract

Paper Structure (17 sections, 7 theorems, 53 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 7 theorems, 53 equations, 3 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Offline RL
Support Constraint Policy Improvement
IQL
Projection IQL with Support Constraint
Policy Evaluation in Proj-IQL
Policy Improvement in Proj-IQL
Practical Implementation of Proj-IQL
Experiments
Comparisons on D4RL Benchmarks
The Training Curves of $\tau_\text{proj}$
Empirical Study on the Projection Parameter
Conclusion
...and 2 more sections

Key Result

Lemma 4.1

For all $s$, $\tau_1$ and $\tau_2$ such that $\tau_1 < \tau_2$ we get

Figures (3)

Figure 1: The performance of IQL in $\tau=0.3, 0.6, 0.9$ on $10$ D4RL datasets, including Walker2d- Halfcheetah- Hopper- medium-v2 and medium-expert-v2 and AntMaze-umaze-v0, AntMaze-umaze-diverse-v0, AntMaze-medium-play-v0, AntMaze-medium-diverse-v0.
Figure 2: The training curves of normalized score and $\tau_{\text{proj}}(a|s)$ on AntMaze-v0 and Kitchen-v0 datasets. The solid line and shaded regions represent the mean and standard deviation, respectively.
Figure 3: The training curves of normalized score and $\tau_{\text{proj}}(a|s)$ on Kitchen-v0 datasets under $batch \; size = 16, 64, 128, 256$. The solid line and shaded regions represent the mean and standard deviation, respectively.

Theorems & Definitions (14)

Lemma 4.1
Theorem 4.2
Lemma 4.3
Theorem 4.4
Theorem 4.5
Lemma 4.6
Theorem 4.7
proof
proof
proof
...and 4 more

Projection Implicit Q-Learning with Support Constraint for Offline Reinforcement Learning

TL;DR

Abstract

Projection Implicit Q-Learning with Support Constraint for Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (14)