Q-value Regularized Transformer for Offline Reinforcement Learning

Shengchao Hu; Ziqing Fan; Chaoqin Huang; Li Shen; Ya Zhang; Yanfeng Wang; Dacheng Tao

Q-value Regularized Transformer for Offline Reinforcement Learning

Shengchao Hu, Ziqing Fan, Chaoqin Huang, Li Shen, Ya Zhang, Yanfeng Wang, Dacheng Tao

TL;DR

This work addresses offline reinforcement learning with the stitching challenge in Conditional Sequence Modeling by introducing the Q-value Regularized Transformer (QT). QT integrates a Conditional Transformer Policy with a learnable Q-value module, training via a joint loss that combines trajectory-based regularization with policy improvement from Q-values, using an $n$-step Bellman estimate. Empirical results on the D4RL suite show QT achieving state-of-the-art performance across multiple domains, with ablations confirming the critical role of the Q-value module and the benefits of Q-value–guided inference. The approach offers a robust path to better stitching of sub-optimal trajectories into high-return behaviors, particularly in long-horizon and sparse-reward offline settings.

Abstract

Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of Conditional Sequence Modeling (CSM), a paradigm that learns the action distribution based on history trajectory and target returns for each state. However, these methods often struggle with stitching together optimal trajectories from sub-optimal ones due to the inconsistency between the sampled returns within individual trajectories and the optimal returns across multiple trajectories. Fortunately, Dynamic Programming (DP) methods offer a solution by leveraging a value function to approximate optimal future returns for each state, while these techniques are prone to unstable learning behaviors, particularly in long-horizon and sparse-reward scenarios. Building upon these insights, we propose the Q-value regularized Transformer (QT), which combines the trajectory modeling ability of the Transformer with the predictability of optimal future returns from DP methods. QT learns an action-value function and integrates a term maximizing action-values into the training loss of CSM, which aims to seek optimal actions that align closely with the behavior policy. Empirical evaluations on D4RL benchmark datasets demonstrate the superiority of QT over traditional DP and CSM methods, highlighting the potential of QT to enhance the state-of-the-art in offline RL.

Q-value Regularized Transformer for Offline Reinforcement Learning

TL;DR

-step Bellman estimate. Empirical results on the D4RL suite show QT achieving state-of-the-art performance across multiple domains, with ablations confirming the critical role of the Q-value module and the benefits of Q-value–guided inference. The approach offers a robust path to better stitching of sub-optimal trajectories into high-return behaviors, particularly in long-horizon and sparse-reward offline settings.

Abstract

Paper Structure (25 sections, 5 theorems, 15 equations, 2 figures, 11 tables)

This paper contains 25 sections, 5 theorems, 15 equations, 2 figures, 11 tables.

Introduction
Preliminary
Offline Reinforcement Learning
Rethinking Stitching in CSM
Methodology
Conditional Transformer Policy
Training with Q-value Regularization
Inference with Q-value Module
Experiment
Main Results
Ablation Study
Related Work
Conclusion
Proofs
Proof of Theorem \ref{['thm:onlydt']}
...and 10 more sections

Key Result

Theorem 3.1

Consider an MDP, behavior policy $\beta$, and decision transformer $\pi$ with condition function $f$. Assume the $\epsilon$-near determinism of the MDP, where $P( r \neq {\mathcal{R}}({\mathbf{s}},{\mathbf{a}}) ~or~ {\mathbf{s}}' \neq {\mathcal{T}}({\mathbf{s}},{\mathbf{a}}) | {\mathbf{s}},{\mathbf{ where ${\mathcal{H}}$ is the horizon of the MDP.

Figures (2)

Figure 1: Evaluation results for CQL, DT, QDT, and QT in the Maze2D tasks (a) and MuJoCo Gym delayed reward (medium) tasks (b). The results show that DT fails to effectively stitch trajectories and CQL under-performs in sparse reward scenarios (delayed reward). QDT yields consistent yet intermediate results across all environments, while QT consistently secures the top performance across all tested environments, showcasing its superiority.
Figure 2: Ablation on the long task horizon ability. This encompasses the performance comparison of different input sequence horizons $K \in [10, 80]$ in the walker2d-medium-replay-v2 task.

Theorems & Definitions (7)

Theorem 3.1
Theorem 3.2
Lemma 1.1
proof
Lemma 1.2
Lemma 1.3
proof

Q-value Regularized Transformer for Offline Reinforcement Learning

TL;DR

Abstract

Q-value Regularized Transformer for Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (7)