Table of Contents
Fetching ...

Dataset Poisoning Attacks on Behavioral Cloning Policies

Akansha Kalra, Soumil Datta, Ethan Gilmore, Duc La, Guanhong Tao, Daniel S. Brown

TL;DR

The paper investigates the robustness of Behavioral Cloning (BC) to clean-label dataset poisoning backdoors in imitation learning, showing that a malicious actor can insert a visual trigger into a small fraction of demonstrations to create a backdoor that activates a target action $a_{\rm target}$ at test time, affecting $J(\pi_{\rm bc}) = \mathbb{E}_{\pi_{\rm bc}}[\sum_t r_t]$. It introduces an entropy-based test-time trigger strategy using $\mathcal{H}(\pi(\cdot\mid s))$ to select critical states and degrade performance under a limited attack budget, and provides extensive empirical results in the Car Racing environment across trigger types, poisoning fractions, and patch sizes. The key finding is that BC policies can achieve near-baseline performance yet be highly vulnerable to backdoors, with poisoning as low as approximately 2.31% of the dataset enabling near-perfect control when triggered, underscoring the need for defenses and robust evaluation in imitation learning. The work highlights practical security risks for real-world cyber-physical systems and motivates future research on backdoor detection and robust BC.

Abstract

Behavior Cloning (BC) is a popular framework for training sequential decision policies from expert demonstrations via supervised learning. As these policies are increasingly being deployed in the real world, their robustness and potential vulnerabilities are an important concern. In this work, we perform the first analysis of the efficacy of clean-label backdoor attacks on BC policies. Our backdoor attacks poison a dataset of demonstrations by injecting a visual trigger to create a spurious correlation that can be exploited at test time. We evaluate how policy vulnerability scales with the fraction of poisoned data, the strength of the trigger, and the trigger type. We also introduce a novel entropy-based test-time trigger attack that substantially degrades policy performance by identifying critical states where test-time triggering of the backdoor is expected to be most effective at degrading performance. We empirically demonstrate that BC policies trained on even minimally poisoned datasets exhibit deceptively high, near-baseline task performance despite being highly vulnerable to backdoor trigger attacks during deployment. Our results underscore the urgent need for more research into the robustness of BC policies, particularly as large-scale datasets are increasingly used to train policies for real-world cyber-physical systems. Videos and code are available at https://sites.google.com/view/dataset-poisoning-in-bc.

Dataset Poisoning Attacks on Behavioral Cloning Policies

TL;DR

The paper investigates the robustness of Behavioral Cloning (BC) to clean-label dataset poisoning backdoors in imitation learning, showing that a malicious actor can insert a visual trigger into a small fraction of demonstrations to create a backdoor that activates a target action at test time, affecting . It introduces an entropy-based test-time trigger strategy using to select critical states and degrade performance under a limited attack budget, and provides extensive empirical results in the Car Racing environment across trigger types, poisoning fractions, and patch sizes. The key finding is that BC policies can achieve near-baseline performance yet be highly vulnerable to backdoors, with poisoning as low as approximately 2.31% of the dataset enabling near-perfect control when triggered, underscoring the need for defenses and robust evaluation in imitation learning. The work highlights practical security risks for real-world cyber-physical systems and motivates future research on backdoor detection and robust BC.

Abstract

Behavior Cloning (BC) is a popular framework for training sequential decision policies from expert demonstrations via supervised learning. As these policies are increasingly being deployed in the real world, their robustness and potential vulnerabilities are an important concern. In this work, we perform the first analysis of the efficacy of clean-label backdoor attacks on BC policies. Our backdoor attacks poison a dataset of demonstrations by injecting a visual trigger to create a spurious correlation that can be exploited at test time. We evaluate how policy vulnerability scales with the fraction of poisoned data, the strength of the trigger, and the trigger type. We also introduce a novel entropy-based test-time trigger attack that substantially degrades policy performance by identifying critical states where test-time triggering of the backdoor is expected to be most effective at degrading performance. We empirically demonstrate that BC policies trained on even minimally poisoned datasets exhibit deceptively high, near-baseline task performance despite being highly vulnerable to backdoor trigger attacks during deployment. Our results underscore the urgent need for more research into the robustness of BC policies, particularly as large-scale datasets are increasingly used to train policies for real-world cyber-physical systems. Videos and code are available at https://sites.google.com/view/dataset-poisoning-in-bc.

Paper Structure

This paper contains 15 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Sensitivity to the percentage of poisoned training data. The figure displays mean episode reward (left) and backdoor control rate (right) for varying poisoning percentages, based on a fixed 3×3 red patch on gas-labeled frames, averaged across 5 seeds with 10 rollouts per seed with error bars denoting standard deviation. We show that BC policies can achieve near-baseline performance despite being poisoned while even the minimal amount of poison injected lends complete controllability to the attacker.
  • Figure 2: Visual comparison of clean and poisoned frames using different attack types. All attacks shown are a 3$\times$3 patch in the top left corner of the visual observation.
  • Figure 3: Patch size sensitivity for red pixel and Gaussian noise patches. All models are trained with 5% of gas-labeled frames poisoned using an $N{\times}N$ red square patch. We see that we have similar reward performance for both patch types across all $N$. We also see that the backdoor effect rises at a very small $N$ and stays fairly stable afterwards. The Gaussian patch performs worse in terms of control rate as a small poisoning rate is not enough for the policy to learn a more subtle trigger.
  • Figure 4: Patch-size sweep comparison at a higher poisoning rate (20% of gas-labeled frames) for red and Gaussian patches. On the left, we see the mean reward being comparable between both patch types, while on the right we see a high control rate via the patch across the board. The red patch outperforms the Gaussian patch at smaller patch sizes, but the Gaussian patch still performs well with over 90% attack accuracy.
  • Figure 5: Visual comparison of poisoned frames using different attack types. All attacks shown are a 3$\times$3 patch in the top left corner of the visual observation.
  • ...and 1 more figures