Table of Contents
Fetching ...

Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets

Shriram Karpoora Sundara Pandian, Ali Baheri

TL;DR

The paper tackles learning control policies from offline datasets contaminated by poisoning in safety-critical domains. It introduces Density-Ratio Weighted Behavioral Cloning (Weighted BC), which uses a small clean reference set to train a binary discriminator, derives trajectory weights from a density ratio, and clips these weights to bias BC toward expert-like behavior without modeling the contamination mechanism. The authors provide uniform clean-risk and excess risk guarantees that can be independent of the contamination rate under suitable clipping, and they propose a three-stage training pipeline (discriminator training, weight computation, and policy learning with fixed weights). Empirically, Weighted BC demonstrates strong robustness across multiple D4RL-contaminated benchmarks, outperforming traditional BC, BCQ, and BRAC, with modest computational overhead. The approach offers a practical, theoretically grounded method for robust offline imitation in the presence of data poisoning.

Abstract

Offline reinforcement learning (RL) enables policy optimization from fixed datasets, making it suitable for safety-critical applications where online exploration is infeasible. However, these datasets are often contaminated by adversarial poisoning, system errors, or low-quality samples, leading to degraded policy performance in standard behavioral cloning (BC) and offline RL methods. This paper introduces Density-Ratio Weighted Behavioral Cloning (Weighted BC), a robust imitation learning approach that uses a small, verified clean reference set to estimate trajectory-level density ratios via a binary discriminator. These ratios are clipped and used as weights in the BC objective to prioritize clean expert behavior while down-weighting or discarding corrupted data, without requiring knowledge of the contamination mechanism. We establish theoretical guarantees showing convergence to the clean expert policy with finite-sample bounds that are independent of the contamination rate. A comprehensive evaluation framework is established, which incorporates various poisoning protocols (reward, state, transition, and action) on continuous control benchmarks. Experiments demonstrate that Weighted BC maintains near-optimal performance even at high contamination ratios outperforming baselines such as traditional BC, batch-constrained Q-learning (BCQ) and behavior regularized actor-critic (BRAC).

Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets

TL;DR

The paper tackles learning control policies from offline datasets contaminated by poisoning in safety-critical domains. It introduces Density-Ratio Weighted Behavioral Cloning (Weighted BC), which uses a small clean reference set to train a binary discriminator, derives trajectory weights from a density ratio, and clips these weights to bias BC toward expert-like behavior without modeling the contamination mechanism. The authors provide uniform clean-risk and excess risk guarantees that can be independent of the contamination rate under suitable clipping, and they propose a three-stage training pipeline (discriminator training, weight computation, and policy learning with fixed weights). Empirically, Weighted BC demonstrates strong robustness across multiple D4RL-contaminated benchmarks, outperforming traditional BC, BCQ, and BRAC, with modest computational overhead. The approach offers a practical, theoretically grounded method for robust offline imitation in the presence of data poisoning.

Abstract

Offline reinforcement learning (RL) enables policy optimization from fixed datasets, making it suitable for safety-critical applications where online exploration is infeasible. However, these datasets are often contaminated by adversarial poisoning, system errors, or low-quality samples, leading to degraded policy performance in standard behavioral cloning (BC) and offline RL methods. This paper introduces Density-Ratio Weighted Behavioral Cloning (Weighted BC), a robust imitation learning approach that uses a small, verified clean reference set to estimate trajectory-level density ratios via a binary discriminator. These ratios are clipped and used as weights in the BC objective to prioritize clean expert behavior while down-weighting or discarding corrupted data, without requiring knowledge of the contamination mechanism. We establish theoretical guarantees showing convergence to the clean expert policy with finite-sample bounds that are independent of the contamination rate. A comprehensive evaluation framework is established, which incorporates various poisoning protocols (reward, state, transition, and action) on continuous control benchmarks. Experiments demonstrate that Weighted BC maintains near-optimal performance even at high contamination ratios outperforming baselines such as traditional BC, batch-constrained Q-learning (BCQ) and behavior regularized actor-critic (BRAC).

Paper Structure

This paper contains 16 sections, 12 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Average return as a function of contamination level $\alpha$ across four D4RL environments (columns) and four poisoning types (rows). Shaded regions indicate high contamination ($\alpha \geq 0.8$). Weighted BC maintains superior performance compared to Traditional BC, BCQ, and BRAC, particularly under severe contamination. Error bars represent standard error over 5 random seeds.
  • Figure 2: Relative performance improvement of Weighted BC over the best baseline at each contamination level, shown as percentage gains. Green cells indicate superior performance, with darker shades representing larger improvements (up to 200%). Weighted BC shows consistent superiority with 93% of scenarios showing positive improvement, particularly under high contamination and action poisoning.
  • Figure 3: Performance retention normalized to clean baseline ($\alpha = 0$) as contamination increases across four poisoning types. Shaded regions represent variance across environments. Weighted BC maintains over 80% retention up to 60% contamination, while Traditional BC shows linear degradation ($R(\alpha) \approx 1 - 0.8\alpha$). BCQ and BRAC exhibit non-monotonic brittleness in their conservatism mechanisms.