Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning
Tung Nguyen, Qinqing Zheng, Aditya Grover
TL;DR
This work analyzes the reliability challenges of return-conditioned behavioral cloning for offline reinforcement learning, showing that conditioning on out-of-distribution returns can cause instability due to suboptimal data and model choices. It introduces ConserWeightive Behavioral Cloning (CWBC), a simple framework with two components: trajectory weighting to bias training toward higher-return trajectories and conservative regularization to keep predictions aligned with in-distribution behavior under ood conditioning. When instantiated with Decision Transformer and Reinforcement Learning via Supervised Learning, CWBC yields significant improvements on D4RL locomotion benchmarks (e.g., substantial gains over baselines and competitive performance with value-based methods) and enhances robustness in other domains like Atari and Antmaze. Overall, CWBC provides a robust, tuning-free approach to improve the reliability of return-conditioned offline BC, bringing it closer to practical deployment while highlighting ongoing challenges in extrapolation beyond offline data.
Abstract
Behavioral cloning (BC) provides a straightforward solution to offline RL by mimicking offline trajectories via supervised learning. Recent advances (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) have shown that by conditioning on desired future returns, BC can perform competitively to their value-based counterparts, while enjoying much more simplicity and training stability. While promising, we show that these methods can be unreliable, as their performance may degrade significantly when conditioned on high, out-of-distribution (ood) returns. This is crucial in practice, as we often expect the policy to perform better than the offline dataset by conditioning on an ood value. We show that this unreliability arises from both the suboptimality of training data and model architectures. We propose ConserWeightive Behavioral Cloning (CWBC), a simple and effective method for improving the reliability of conditional BC with two key components: trajectory weighting and conservative regularization. Trajectory weighting upweights the high-return trajectories to reduce the train-test gap for BC methods, while conservative regularizer encourages the policy to stay close to the data distribution for ood conditioning. We study CWBC in the context of RvS (Emmons et al., 2021) and Decision Transformers (Chen et al., 2021), and show that CWBC significantly boosts their performance on various benchmarks.
