When should we prefer Decision Transformers for Offline Reinforcement Learning?

Prajjwal Bhargava; Rohan Chitnis; Alborz Geramifard; Shagun Sodhani; Amy Zhang

When should we prefer Decision Transformers for Offline Reinforcement Learning?

Prajjwal Bhargava, Rohan Chitnis, Alborz Geramifard, Shagun Sodhani, Amy Zhang

TL;DR

This work investigates the performance of three popular algorithms for offline RL, Conservative Q-Learning, Behavior Cloning, and Decision Transformer, and finds that scaling the amount of data for DT by 5x gives a 2.5x average score improvement on Atari and makes design/scaling recommendations.

Abstract

Offline reinforcement learning (RL) allows agents to learn effective, return-maximizing policies from a static dataset. Three popular algorithms for offline RL are Conservative Q-Learning (CQL), Behavior Cloning (BC), and Decision Transformer (DT), from the class of Q-Learning, Imitation Learning, and Sequence Modeling respectively. A key open question is: which algorithm is preferred under what conditions? We study this question empirically by exploring the performance of these algorithms across the commonly used D4RL and Robomimic benchmarks. We design targeted experiments to understand their behavior concerning data suboptimality, task complexity, and stochasticity. Our key findings are: (1) DT requires more data than CQL to learn competitive policies but is more robust; (2) DT is a substantially better choice than both CQL and BC in sparse-reward and low-quality data settings; (3) DT and BC are preferable as task horizon increases, or when data is obtained from human demonstrators; and (4) CQL excels in situations characterized by the combination of high stochasticity and low data quality. We also investigate architectural choices and scaling trends for DT on Atari and D4RL and make design/scaling recommendations. We find that scaling the amount of data for DT by 5x gives a 2.5x average score improvement on Atari.

When should we prefer Decision Transformers for Offline Reinforcement Learning?

TL;DR

Abstract

Paper Structure (32 sections, 1 equation, 18 figures, 48 tables)

This paper contains 32 sections, 1 equation, 18 figures, 48 tables.

Introduction
Related Work
Preliminaries
Background
Experimental Setup
Experiments
Establishing Baseline Results
How does the amount and quality of data affect each agent's performance?
How are agents affected when trajectory lengths in the dataset increase?
How are agents affected when random data is added to the dataset?
How are agents affected by the complexity of the task?
How do agents behave in stochastic environments ?
Scaling Properties of Decision Transformers on Atari
Limitations and Future Work
Architectural Properties of Decision Transformers
...and 17 more sections

Figures (18)

Figure 1: Normalized d4rl returns obtained by training DT, CQL, and BC on various amounts of highest-return ("best") or lowest-return ("worst") data. The left plot (a) is on medium replay data, while the right plot (b) is on medium expert data; both plots average over the halfcheetah, hopper and walker tasks. Notice that the Y-axis limits are set lower on the left plot. We observe that CQL was the most sample-efficient agent when only a small amount of high-quality data was available, but it could degrade with lower-quality data. Meanwhile, the performance of DT never worsened with more data. Also, BC performed best with 20-40% of high-quality data, highlighting the importance of expert data for BC. For full results, refer to Figure \ref{['fig:sample_eff_d4rl_appendix']} in Appendix \ref{['app:additionalresults']}.
Figure 2: The impact of adding random data to the offline dataset for DT, CQL, and BC. The Y-axis shows the relative deterioration in the evaluation score. Results are averaged over tasks and over two strategies for creating random data. The robomimic PH dataset does not have a dense-reward split.
Figure 3: Impact of increasing state-space dimensionality on d4rl medium expert data (left) and task horizon on robomimic with both PH data and PH + equal amount of random data (right).
Figure 4: atari results for DT while scaling data only (left, blue), parameters only (right), and both simultaneously (left, orange), averaged over breakout, qbert, seaquest, and pong. Numbers on top of the orange curve show model parameter sizes. We found that scaling data was more impactful than scaling parameters, but scaling both together gave minor gains over scaling data only.
Figure 5: Results on dense-reward d4rl with added stochasticity during evaluation, averaged over the halfcheetah, hopper and walker tasks. % deterioration (Y-axis) represents the performance drop relative to running evaluations without any stochasticity. Numbers within boxes above each bar represent the average normalized scores obtained on that dataset with the stochasticity parameters.
...and 13 more figures

When should we prefer Decision Transformers for Offline Reinforcement Learning?

TL;DR

Abstract

When should we prefer Decision Transformers for Offline Reinforcement Learning?

Authors

TL;DR

Abstract

Table of Contents

Figures (18)