Table of Contents
Fetching ...

Reinforcement Learning Gradients as Vitamin for Online Finetuning Decision Transformers

Kai Yan, Alexander G. Schwing, Yu-Xiong Wang

TL;DR

It is found that simply adding TD3 gradients to the finetuning process of ODT effectively improves the online finetuning performance of ODT, especially if ODT is pretrained with low-reward offline data.

Abstract

Decision Transformers have recently emerged as a new and compelling paradigm for offline Reinforcement Learning (RL), completing a trajectory in an autoregressive way. While improvements have been made to overcome initial shortcomings, online finetuning of decision transformers has been surprisingly under-explored. The widely adopted state-of-the-art Online Decision Transformer (ODT) still struggles when pretrained with low-reward offline data. In this paper, we theoretically analyze the online-finetuning of the decision transformer, showing that the commonly used Return-To-Go (RTG) that's far from the expected return hampers the online fine-tuning process. This problem, however, is well-addressed by the value function and advantage of standard RL algorithms. As suggested by our analysis, in our experiments, we hence find that simply adding TD3 gradients to the finetuning process of ODT effectively improves the online finetuning performance of ODT, especially if ODT is pretrained with low-reward offline data. These findings provide new directions to further improve decision transformers.

Reinforcement Learning Gradients as Vitamin for Online Finetuning Decision Transformers

TL;DR

It is found that simply adding TD3 gradients to the finetuning process of ODT effectively improves the online finetuning performance of ODT, especially if ODT is pretrained with low-reward offline data.

Abstract

Decision Transformers have recently emerged as a new and compelling paradigm for offline Reinforcement Learning (RL), completing a trajectory in an autoregressive way. While improvements have been made to overcome initial shortcomings, online finetuning of decision transformers has been surprisingly under-explored. The widely adopted state-of-the-art Online Decision Transformer (ODT) still struggles when pretrained with low-reward offline data. In this paper, we theoretically analyze the online-finetuning of the decision transformer, showing that the commonly used Return-To-Go (RTG) that's far from the expected return hampers the online fine-tuning process. This problem, however, is well-addressed by the value function and advantage of standard RL algorithms. As suggested by our analysis, in our experiments, we hence find that simply adding TD3 gradients to the finetuning process of ODT effectively improves the online finetuning performance of ODT, especially if ODT is pretrained with low-reward offline data. These findings provide new directions to further improve decision transformers.

Paper Structure

This paper contains 42 sections, 7 theorems, 15 equations, 26 figures, 11 tables.

Key Result

Lemma 1

(Informal) Assume rewards $r(s,a)$ are bounded in $[0, R_{\text{max}}]$,Note we use "max" instead of "$\beta$max" as this is a property of the environment and not the dataset. and $\text{RTG}_{\text{eval}}\geq \text{RTG}_{\beta\text{max}}$. Then with probability at least $1-\delta$, we have the prob where $\delta$ depends on the number of trajectories in the dataset and prior distribution (see App

Figures (26)

  • Figure 1: An overview of our work, illustrating why ODT fails to improve with low-return offline data and RL gradients such as TD3 could help. The decision transformer yields gradient $\frac{\partial a}{\partial \text{RTG}}$, but local policy improvement requires the opposite, i.e., $\frac{\partial \text{RTG}}{\partial a}$. Therefore, the agent cannot recover if the current policy conditioning on high target RTG does not actually lead to high real RTG, which is very likely when the target RTG is too far from the pretrained policy and out-of-distribution. By adding a small coefficient for RL gradients, the agents can improve locally, which leads to better performance.
  • Figure 2: An illustration of a simple MDP, showing how RL can infer the direction for improvement, while online DT fails. Panels (a) and (b) show, DDPG and ODT+DDPG manage to maximize reward and find the correct optimal action quickly, while ODT fails to do so. Panel (c) shows how a DDPG/ODT+DDPG critic (from light blue/orange to dark blue/red) manages to fit ground truth reward (green curve). Panel (d) shows that the ODT policy (changing from light gray to dark) fails to discover the hidden reward peak near $0$ between two low-reward areas (near $-1$ and $1$ respectively) contained in the offline data. Meanwhile, ODT+DDPG succeeds in finding the reward peak.
  • Figure 3: Results on Adroit DAPG environments. The proposed method, TD3+ODT, improves upon baselines. Note that TD3, IQL, and TD3+ODT all perform decently at the beginning of online finetuning, but TD3 fails while TD3+ODT improves much more than IQL during online finetuning.
  • Figure 4: Reward curves for each method in Antmaze environments. IQL works best on the large maze, while our proposed method works the best on the medium maze and umaze. DDPG+ODT works worse than our method and IQL but much better than the rest of the baselines, which again validates our motivation that adding RL gradients to ODT is helpful.
  • Figure 5: Panel (a) shows ablations on RL coefficient $\alpha$. While higher $\alpha$ aids exploration as shown in the halfcheetah-medium-replay-v2 case, it may sometimes introduce instability, which is shown in the hammer-human-v1 case. Panel (b) shows ablations on $T_{\text{eval}}$. $T_{\text{eval}}$ balances training stability and more information for decision-making.
  • ...and 21 more figures

Theorems & Definitions (16)

  • Lemma 1
  • Corollary 1
  • Lemma 2
  • proof
  • Theorem E.1
  • proof
  • Definition E.2
  • Remark E.4
  • Lemma 3
  • proof
  • ...and 6 more