Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets
Shriram Karpoora Sundara Pandian, Ali Baheri
TL;DR
The paper tackles learning control policies from offline datasets contaminated by poisoning in safety-critical domains. It introduces Density-Ratio Weighted Behavioral Cloning (Weighted BC), which uses a small clean reference set to train a binary discriminator, derives trajectory weights from a density ratio, and clips these weights to bias BC toward expert-like behavior without modeling the contamination mechanism. The authors provide uniform clean-risk and excess risk guarantees that can be independent of the contamination rate under suitable clipping, and they propose a three-stage training pipeline (discriminator training, weight computation, and policy learning with fixed weights). Empirically, Weighted BC demonstrates strong robustness across multiple D4RL-contaminated benchmarks, outperforming traditional BC, BCQ, and BRAC, with modest computational overhead. The approach offers a practical, theoretically grounded method for robust offline imitation in the presence of data poisoning.
Abstract
Offline reinforcement learning (RL) enables policy optimization from fixed datasets, making it suitable for safety-critical applications where online exploration is infeasible. However, these datasets are often contaminated by adversarial poisoning, system errors, or low-quality samples, leading to degraded policy performance in standard behavioral cloning (BC) and offline RL methods. This paper introduces Density-Ratio Weighted Behavioral Cloning (Weighted BC), a robust imitation learning approach that uses a small, verified clean reference set to estimate trajectory-level density ratios via a binary discriminator. These ratios are clipped and used as weights in the BC objective to prioritize clean expert behavior while down-weighting or discarding corrupted data, without requiring knowledge of the contamination mechanism. We establish theoretical guarantees showing convergence to the clean expert policy with finite-sample bounds that are independent of the contamination rate. A comprehensive evaluation framework is established, which incorporates various poisoning protocols (reward, state, transition, and action) on continuous control benchmarks. Experiments demonstrate that Weighted BC maintains near-optimal performance even at high contamination ratios outperforming baselines such as traditional BC, batch-constrained Q-learning (BCQ) and behavior regularized actor-critic (BRAC).
