Table of Contents
Fetching ...

Adjusting the Output of Decision Transformer with Action Gradient

Rui Lin, Yiwen Zhang, Zhicheng Peng, Minghao Lyu

TL;DR

This paper tackles extrapolation in offline RL by combining transformer-based Decision Transformers with an Action Gradient (AG) mechanism. AG refines actions at evaluation time via the gradient of the Q-value with respect to the action, complementing token-prediction strategies and avoiding unstable training. Using a critic trained with expectile regression, the method improves state-level extrapolation and demonstrates strong performance on D4RL Gym and Maze2d benchmarks, sometimes reaching state-of-the-art levels. The work also analyzes the limitations of policy-gradient-based augmentations and argues that AG offers a robust, compatible alternative with favorable hyperparameter properties for DT-based offline RL.

Abstract

Decision Transformer (DT), which integrates reinforcement learning (RL) with the transformer model, introduces a novel approach to offline RL. Unlike classical algorithms that take maximizing cumulative discounted rewards as objective, DT instead maximizes the likelihood of actions. This paradigm shift, however, presents two key challenges: stitching trajectories and extrapolation of action. Existing methods, such as substituting specific tokens with predictive values and integrating the Policy Gradient (PG) method, address these challenges individually but fail to improve performance stably when combined due to inherent instability. To address this, we propose Action Gradient (AG), an innovative methodology that directly adjusts actions to fulfill a function analogous to that of PG, while also facilitating efficient integration with token prediction techniques. AG utilizes the gradient of the Q-value with respect to the action to optimize the action. The empirical results demonstrate that our method can significantly enhance the performance of DT-based algorithms, with some results achieving state-of-the-art levels.

Adjusting the Output of Decision Transformer with Action Gradient

TL;DR

This paper tackles extrapolation in offline RL by combining transformer-based Decision Transformers with an Action Gradient (AG) mechanism. AG refines actions at evaluation time via the gradient of the Q-value with respect to the action, complementing token-prediction strategies and avoiding unstable training. Using a critic trained with expectile regression, the method improves state-level extrapolation and demonstrates strong performance on D4RL Gym and Maze2d benchmarks, sometimes reaching state-of-the-art levels. The work also analyzes the limitations of policy-gradient-based augmentations and argues that AG offers a robust, compatible alternative with favorable hyperparameter properties for DT-based offline RL.

Abstract

Decision Transformer (DT), which integrates reinforcement learning (RL) with the transformer model, introduces a novel approach to offline RL. Unlike classical algorithms that take maximizing cumulative discounted rewards as objective, DT instead maximizes the likelihood of actions. This paradigm shift, however, presents two key challenges: stitching trajectories and extrapolation of action. Existing methods, such as substituting specific tokens with predictive values and integrating the Policy Gradient (PG) method, address these challenges individually but fail to improve performance stably when combined due to inherent instability. To address this, we propose Action Gradient (AG), an innovative methodology that directly adjusts actions to fulfill a function analogous to that of PG, while also facilitating efficient integration with token prediction techniques. AG utilizes the gradient of the Q-value with respect to the action to optimize the action. The empirical results demonstrate that our method can significantly enhance the performance of DT-based algorithms, with some results achieving state-of-the-art levels.

Paper Structure

This paper contains 16 sections, 9 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Stitching trajectories through conditioning $RTG$ token.
  • Figure 2: The left graph presents the distribution of data within the dataset and the critic's outputs corresponding to various actions. The right graph presents different algorithms' rewards in this special state (DT: Decision Transformer, PG: Policy Gradient, TP: Token prediction, AG: Action Gradient). The state-level extrapolation ability of DT is limited, and token prediction does not effectively address this deficiency. Utilizing a critic to compute gradients can substantially enhance this capability in ways that alternative methods cannot achieve.
  • Figure 3: The experimental results of the ablation study focusing on the coefficient $\eta$ and the number of iterations $n$.