Table of Contents
Fetching ...

Adaptive $Q$-Aid for Conditional Supervised Learning in Offline Reinforcement Learning

Jeonghye Kim, Suyoung Lee, Woojun Kim, Youngchul Sung

TL;DR

This work introduces Q-Aided Conditional Supervised Learning (QCS), which effectively combines the stability of RCSL with the stitching capability of Q-functions, and adaptively integrates RCSL's loss function based on trajectory return.

Abstract

Offline reinforcement learning (RL) has progressed with return-conditioned supervised learning (RCSL), but its lack of stitching ability remains a limitation. We introduce $Q$-Aided Conditional Supervised Learning (QCS), which effectively combines the stability of RCSL with the stitching capability of $Q$-functions. By analyzing $Q$-function over-generalization, which impairs stable stitching, QCS adaptively integrates $Q$-aid into RCSL's loss function based on trajectory return. Empirical results show that QCS significantly outperforms RCSL and value-based methods, consistently achieving or exceeding the maximum trajectory returns across diverse offline RL benchmarks.

Adaptive $Q$-Aid for Conditional Supervised Learning in Offline Reinforcement Learning

TL;DR

This work introduces Q-Aided Conditional Supervised Learning (QCS), which effectively combines the stability of RCSL with the stitching capability of Q-functions, and adaptively integrates RCSL's loss function based on trajectory return.

Abstract

Offline reinforcement learning (RL) has progressed with return-conditioned supervised learning (RCSL), but its lack of stitching ability remains a limitation. We introduce -Aided Conditional Supervised Learning (QCS), which effectively combines the stability of RCSL with the stitching capability of -functions. By analyzing -function over-generalization, which impairs stable stitching, QCS adaptively integrates -aid into RCSL's loss function based on trajectory return. Empirical results show that QCS significantly outperforms RCSL and value-based methods, consistently achieving or exceeding the maximum trajectory returns across diverse offline RL benchmarks.
Paper Structure (41 sections, 10 equations, 12 figures, 18 tables, 1 algorithm)

This paper contains 41 sections, 10 equations, 12 figures, 18 tables, 1 algorithm.

Figures (12)

  • Figure 1: Conceptual idea of QCS: Follow RCSL when learning from optimal trajectories where it predicts actions confidently but the $Q$-function may stitch incorrectly. Conversely, refer to the $Q$-function when learning from sub-optimal trajectories where RCSL is less certain but the $Q$-function is likely accurate.
  • Figure 2: Mean normalized return in MuJoCo medium, medium-replay, medium-expert, and AntMaze large. The scores of RCSL, the value-based methods, and the combined methods represent the maximum mean performances within their respective groups. The full scores are in Section \ref{['overall-performance']}.
  • Figure 3: An example demonstrating the limit of RCSL: The dataset consists of two trajectories, with a time limit of $T=3$ and a discount factor $\gamma=1$. The black dashed arrow represents the optimal policy yielding a maximum return of 7.
  • Figure 4: (a) the view of the environment and true $Q$ calculated through value iteration, (b) training datasets with color representing the true $Q$ for each sample, (c) $Q_\theta$ learned through regression with a medium dataset (upper) and an expert dataset (bottom), (d) $Q_\theta$ learned through IQL with a medium dataset (upper) and an expert dataset (bottom).
  • Figure 5: We present the estimated $Q_{\theta}(s,\bar{a})$ for $\bar{a}\in\mathcal{A}$ and the normalized NTK $k_{\theta}(s,\bar{a},s,a_{\text{ref}})/\lVert \nabla_{\theta}Q_{\theta}(s,a_{\text{ref}}) \rVert_{2}^{2}$ across four datasets with a 1D action space for Inverted Double Pendulum and a 3D action space for Hopper. In these figures, we fix the state $s$ and the fixed reference action $a_{\text{ref}}$ at zero (marked as $\filledstar$), and sweep over all actions $\bar{a}\in\mathcal{A}$. For Hopper, we use axes for action dimensions and color to represent $Q$-values in 3D plots. Additionally, in the NTK plot, we only include the high-NTK regions for values over 0.9. Refer to Appendix \ref{['appx:ntk-further']} for details.
  • ...and 7 more figures