Table of Contents
Fetching ...

ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning

Chen-Xiao Gao, Chenyang Wu, Mingjun Cao, Rui Kong, Zongzhang Zhang, Yang Yu

TL;DR

Evaluation results validate that, by leveraging the power of dynamic programming, ACT demonstrates effective trajectory stitching and robust action generation in spite of the environmental stochasticity, outperforming baseline methods across various benchmarks.

Abstract

Decision Transformer (DT), which employs expressive sequence modeling techniques to perform action generation, has emerged as a promising approach to offline policy optimization. However, DT generates actions conditioned on a desired future return, which is known to bear some weaknesses such as the susceptibility to environmental stochasticity. To overcome DT's weaknesses, we propose to empower DT with dynamic programming. Our method comprises three steps. First, we employ in-sample value iteration to obtain approximated value functions, which involves dynamic programming over the MDP structure. Second, we evaluate action quality in context with estimated advantages. We introduce two types of advantage estimators, IAE and GAE, which are suitable for different tasks. Third, we train an Advantage-Conditioned Transformer (ACT) to generate actions conditioned on the estimated advantages. Finally, during testing, ACT generates actions conditioned on a desired advantage. Our evaluation results validate that, by leveraging the power of dynamic programming, ACT demonstrates effective trajectory stitching and robust action generation in spite of the environmental stochasticity, outperforming baseline methods across various benchmarks. Additionally, we conduct an in-depth analysis of ACT's various design choices through ablation studies. Our code is available at https://github.com/LAMDA-RL/ACT.

ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning

TL;DR

Evaluation results validate that, by leveraging the power of dynamic programming, ACT demonstrates effective trajectory stitching and robust action generation in spite of the environmental stochasticity, outperforming baseline methods across various benchmarks.

Abstract

Decision Transformer (DT), which employs expressive sequence modeling techniques to perform action generation, has emerged as a promising approach to offline policy optimization. However, DT generates actions conditioned on a desired future return, which is known to bear some weaknesses such as the susceptibility to environmental stochasticity. To overcome DT's weaknesses, we propose to empower DT with dynamic programming. Our method comprises three steps. First, we employ in-sample value iteration to obtain approximated value functions, which involves dynamic programming over the MDP structure. Second, we evaluate action quality in context with estimated advantages. We introduce two types of advantage estimators, IAE and GAE, which are suitable for different tasks. Third, we train an Advantage-Conditioned Transformer (ACT) to generate actions conditioned on the estimated advantages. Finally, during testing, ACT generates actions conditioned on a desired advantage. Our evaluation results validate that, by leveraging the power of dynamic programming, ACT demonstrates effective trajectory stitching and robust action generation in spite of the environmental stochasticity, outperforming baseline methods across various benchmarks. Additionally, we conduct an in-depth analysis of ACT's various design choices through ablation studies. Our code is available at https://github.com/LAMDA-RL/ACT.
Paper Structure (30 sections, 1 theorem, 10 equations, 17 figures, 9 tables, 1 algorithm)

This paper contains 30 sections, 1 theorem, 10 equations, 17 figures, 9 tables, 1 algorithm.

Key Result

Lemma A.1

(Performance Difference Lemma policy_difference_lemma) Consider an infinite horizon MDP $\mathcal{M}=\langle\mathcal{S}, \mathcal{A}, T, r, \rho_0, \gamma\rangle$. We have

Figures (17)

  • Figure 1: The performance of DT when the trajectory is divided into chunks. Chunk_size denotes the granularity of division, and full means the complete trajectory is preserved.
  • Figure 2: The encoder-decoder architecture of ACT. The encoder encodes the historical state-action sequence into a continuous representation. The attentive head of the decoder queries historical representation with the advantage and predicts an action.
  • Figure 3: Performance on the 2048 game. We report the average and std of the performance across 5 independent runs and mark ESPER's score as the grey dotted line.
  • Figure 4: Performance curve on the stochastic Gym MuJoCo tasks as the training proceeds. We report the average and the standard deviation of the performance across 4 independent runs.
  • Figure 5: Ablation study on the effect of $\sigma_2$, using datasets from D4RL. The left column depicts the performance of each variant as the training proceeds, and the right column depicts the target advantages given by $c_\phi$. The results are taken from 4 independent runs.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Lemma A.1