Cooperative Backdoor Attack in Decentralized Reinforcement Learning with Theoretical Guarantee

Mengtong Gao; Yifei Zou; Zuyuan Zhang; Xiuzhen Cheng; Dongxiao Yu

Cooperative Backdoor Attack in Decentralized Reinforcement Learning with Theoretical Guarantee

Mengtong Gao, Yifei Zou, Zuyuan Zhang, Xiuzhen Cheng, Dongxiao Yu

TL;DR

This work investigates backdoor attacks in policy-based decentralized reinforcement learning, where agents share policies during learning. It proposes Co-Trojan, a cooperative backdoor that decomposes the global backdoor into components distributed across multiple malicious agents and only assembles the full backdoor when aggregated by benign agents. The authors provide a theoretical guarantee that the distributed decomposition can realize the target backdoor and validate the approach with Atari experiments, demonstrating comparable effectiveness to centralized attacks while increasing stealth. The study highlights a significant security risk in decentralized RL and motivates development of defenses against distributed backdoor strategies.

Abstract

The safety of decentralized reinforcement learning (RL) is a challenging problem since malicious agents can share their poisoned policies with benign agents. The paper investigates a cooperative backdoor attack in a decentralized reinforcement learning scenario. Differing from the existing methods that hide a whole backdoor attack behind their shared policies, our method decomposes the backdoor behavior into multiple components according to the state space of RL. Each malicious agent hides one component in its policy and shares its policy with the benign agents. When a benign agent learns all the poisoned policies, the backdoor attack is assembled in its policy. The theoretical proof is given to show that our cooperative method can successfully inject the backdoor into the RL policies of benign agents. Compared with the existing backdoor attacks, our cooperative method is more covert since the policy from each attacker only contains a component of the backdoor attack and is harder to detect. Extensive simulations are conducted based on Atari environments to demonstrate the efficiency and covertness of our method. To the best of our knowledge, this is the first paper presenting a provable cooperative backdoor attack in decentralized reinforcement learning.

Cooperative Backdoor Attack in Decentralized Reinforcement Learning with Theoretical Guarantee

TL;DR

Abstract

Paper Structure (14 sections, 2 theorems, 31 equations, 3 figures)

This paper contains 14 sections, 2 theorems, 31 equations, 3 figures.

Introduction
Related work
Methodology
System Model of Policy-based Decentralized Reinforcement Learning
Problem Definition of Backdoor Attack in Reinforcement Learning
Detailed Description of Co-Trojan
Theoretical Analysis for the Correctness of Co-Trojan
Numerical Results
Experimental Setup
Numerical Results
Conclusion
Appendix
Proof of Consistency between policy aggregation and value aggregation
Proof of Correctness

Key Result

Theorem 3.1

For any predefined global backdoor attack policy $\pi^\dagger$, a decomposition $\Phi_f(i)$ can be found, which, through the policy sharing process of decentralized RL, can make the resulting global backdoor policy $\pi_g^\dagger$ accurately approximate our target global backdoor policy.

Figures (3)

Figure 1: We study cooperative backdoor policy attacks in decentralized RL. Differing from the single backdoor policy attack that hides a whole backdoor knowledge behind its malign policy, our method decomposes the backdoor behavior into multiple components, each of which is hidden by an individual attacker within its malign policy. When a benign agent learns all the poisoned policies, the backdoor attack is assembled in its policy. Compared with a single backdoor policy attack, our method has the same attacking performance but is harder to detect.
Figure 2: Performance Results for Breakout with Various Poisoning Conditions: (a) Strong Targeted Poison, (b) Weak Targeted Poison, and (c) Untargeted Poison. Each subplot shows the average rewards for TrojDRL (triggered), TrojDRL (clean), Co-Trojan (triggered), and Co-Trojan (clean). The lines are smoothed by averaging every five data points.
Figure 3: Performance Results for Seaquest with Various Poisoning Conditions: (a) Strong Targeted Poison, (b) Weak Targeted Poison, and (c) Untargeted Poison. Each subplot shows the average rewards for TrojDRL (triggered), TrojDRL (clean), Co-Trojan (triggered), and Co-Trojan (clean). The lines are smoothed by averaging every five data points.

Theorems & Definitions (5)

Theorem 3.1
proof
proof
Theorem A.1
proof : Theorem 3.1

Cooperative Backdoor Attack in Decentralized Reinforcement Learning with Theoretical Guarantee

TL;DR

Abstract

Cooperative Backdoor Attack in Decentralized Reinforcement Learning with Theoretical Guarantee

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (5)