Table of Contents
Fetching ...

Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning

Tung Nguyen, Qinqing Zheng, Aditya Grover

TL;DR

This work analyzes the reliability challenges of return-conditioned behavioral cloning for offline reinforcement learning, showing that conditioning on out-of-distribution returns can cause instability due to suboptimal data and model choices. It introduces ConserWeightive Behavioral Cloning (CWBC), a simple framework with two components: trajectory weighting to bias training toward higher-return trajectories and conservative regularization to keep predictions aligned with in-distribution behavior under ood conditioning. When instantiated with Decision Transformer and Reinforcement Learning via Supervised Learning, CWBC yields significant improvements on D4RL locomotion benchmarks (e.g., substantial gains over baselines and competitive performance with value-based methods) and enhances robustness in other domains like Atari and Antmaze. Overall, CWBC provides a robust, tuning-free approach to improve the reliability of return-conditioned offline BC, bringing it closer to practical deployment while highlighting ongoing challenges in extrapolation beyond offline data.

Abstract

Behavioral cloning (BC) provides a straightforward solution to offline RL by mimicking offline trajectories via supervised learning. Recent advances (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) have shown that by conditioning on desired future returns, BC can perform competitively to their value-based counterparts, while enjoying much more simplicity and training stability. While promising, we show that these methods can be unreliable, as their performance may degrade significantly when conditioned on high, out-of-distribution (ood) returns. This is crucial in practice, as we often expect the policy to perform better than the offline dataset by conditioning on an ood value. We show that this unreliability arises from both the suboptimality of training data and model architectures. We propose ConserWeightive Behavioral Cloning (CWBC), a simple and effective method for improving the reliability of conditional BC with two key components: trajectory weighting and conservative regularization. Trajectory weighting upweights the high-return trajectories to reduce the train-test gap for BC methods, while conservative regularizer encourages the policy to stay close to the data distribution for ood conditioning. We study CWBC in the context of RvS (Emmons et al., 2021) and Decision Transformers (Chen et al., 2021), and show that CWBC significantly boosts their performance on various benchmarks.

Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning

TL;DR

This work analyzes the reliability challenges of return-conditioned behavioral cloning for offline reinforcement learning, showing that conditioning on out-of-distribution returns can cause instability due to suboptimal data and model choices. It introduces ConserWeightive Behavioral Cloning (CWBC), a simple framework with two components: trajectory weighting to bias training toward higher-return trajectories and conservative regularization to keep predictions aligned with in-distribution behavior under ood conditioning. When instantiated with Decision Transformer and Reinforcement Learning via Supervised Learning, CWBC yields significant improvements on D4RL locomotion benchmarks (e.g., substantial gains over baselines and competitive performance with value-based methods) and enhances robustness in other domains like Atari and Antmaze. Overall, CWBC provides a robust, tuning-free approach to improve the reliability of return-conditioned offline BC, bringing it closer to practical deployment while highlighting ongoing challenges in extrapolation beyond offline data.

Abstract

Behavioral cloning (BC) provides a straightforward solution to offline RL by mimicking offline trajectories via supervised learning. Recent advances (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) have shown that by conditioning on desired future returns, BC can perform competitively to their value-based counterparts, while enjoying much more simplicity and training stability. While promising, we show that these methods can be unreliable, as their performance may degrade significantly when conditioned on high, out-of-distribution (ood) returns. This is crucial in practice, as we often expect the policy to perform better than the offline dataset by conditioning on an ood value. We show that this unreliability arises from both the suboptimality of training data and model architectures. We propose ConserWeightive Behavioral Cloning (CWBC), a simple and effective method for improving the reliability of conditional BC with two key components: trajectory weighting and conservative regularization. Trajectory weighting upweights the high-return trajectories to reduce the train-test gap for BC methods, while conservative regularizer encourages the policy to stay close to the data distribution for ood conditioning. We study CWBC in the context of RvS (Emmons et al., 2021) and Decision Transformers (Chen et al., 2021), and show that CWBC significantly boosts their performance on various benchmarks.
Paper Structure (30 sections, 14 equations, 12 figures, 8 tables, 2 algorithms)

This paper contains 30 sections, 14 equations, 12 figures, 8 tables, 2 algorithms.

Figures (12)

  • Figure 1: Illustrative figures demonstrating three hypothetical scenarios for conditioning of BC methods for offline RL. The green line shows the maximum return in the offline dataset, while the orange line shows the expert return. The ideal scenario (a) is hard or even impossible to achieve with suboptimal offline data. On the other hand, return-conditioned RL methods can show unreliable generalization (b), where the performance drops quickly after a certain point in the vicinity of the dataset maximum. Our goal is to ensure reliable generalization (c) even when conditioned on ood returns.
  • Figure 2: Reliability of RvS and DT on different walker2d datasets. The first row shows the performance of the two methods, and the second row shows the return distribution of each dataset. Reliability decreases as the data quality decreases from med-expert to med-replay. While DT performs reliably, RvS exhibits vast drops in performance.
  • Figure 3: Performance of DT when the state and RTG tokens are concatenated. The results are averaged over $10$ seeds.
  • Figure 4: The original return distribution $\mathcal{T}$ and the transformed distribution $\widetilde{\mathcal{T}}$ of walker2d-med-replay. We use $B=20$, $\lambda = 0.01$, $\kappa = \widehat{r}^\star - \widehat{r}_{90}$, where $\widehat{r}_{90}$ is the $90$-th percentile of the returns in the offline dataset.
  • Figure 5: The performance of RvS and its two variants on walker2d datasets. RvS+W denotes RvS with trajectory weighting only, while RvS+CWBC is RvS with both trajectory weighting and conservative regularization.
  • ...and 7 more figures