Table of Contents
Fetching ...

Deceptive Sequential Decision-Making via Regularized Policy Optimization

Yerin Kim, Alexander Benvenuti, Bo Chen, Mustafa Karabag, Abhishek Kulkarni, Nathaniel D. Bastian, Ufuk Topcu, Matthew Hale

TL;DR

This work addresses the risk that adversaries may infer a system's objectives by observing actions from an MMDP under inverse reinforcement learning. It introduces three regularized policy-synthesis schemes—diversionary, targeted, and equivocal deception—to actively mislead IRL about the reward structure, while bounding the resulting loss in total reward $R^* - R_ ext{π}$ as a function of a deception parameter $eta$. Each deception type is formulated as a tractable occupancy-measure optimization with a corresponding analytic bound on $L_ ext{π}$, and the authors validate the approach through numerical experiments on a moving-target defense network-defense scenario using Apprenticeship Learning, MaxEnt IRL, and Deep IRL. The results show that deception can steer observer beliefs toward false conclusions while preserving near-optimal performance (often >98% of $R^*$), enabling robust, deception-aware operation in critical autonomous systems.

Abstract

Autonomous systems are increasingly expected to operate in the presence of adversaries, though adversaries may infer sensitive information simply by observing a system. Therefore, present a deceptive sequential decision-making framework that not only conceals sensitive information, but actively misleads adversaries about it. We model autonomous systems as Markov decision processes, with adversaries using inverse reinforcement learning to recover reward functions. To counter them, we present three regularization strategies for policy synthesis problems that actively deceive an adversary about a system's reward. ``Diversionary deception'' leads an adversary to draw any false conclusion about the system's reward function. ``Targeted deception'' leads an adversary to draw a specific false conclusion about the system's reward function. ``Equivocal deception'' leads an adversary to infer that the real reward and a false reward both explain the system's behavior. We show how each form of deception can be implemented in policy optimization problems and analytically bound the loss in total accumulated reward induced by deception. Next, we evaluate these developments in a multi-agent setting. We show that diversionary, targeted, and equivocal deception all steer the adversary to false beliefs while still attaining a total accumulated reward that is at least 98% of its optimal, non-deceptive value.

Deceptive Sequential Decision-Making via Regularized Policy Optimization

TL;DR

This work addresses the risk that adversaries may infer a system's objectives by observing actions from an MMDP under inverse reinforcement learning. It introduces three regularized policy-synthesis schemes—diversionary, targeted, and equivocal deception—to actively mislead IRL about the reward structure, while bounding the resulting loss in total reward as a function of a deception parameter . Each deception type is formulated as a tractable occupancy-measure optimization with a corresponding analytic bound on , and the authors validate the approach through numerical experiments on a moving-target defense network-defense scenario using Apprenticeship Learning, MaxEnt IRL, and Deep IRL. The results show that deception can steer observer beliefs toward false conclusions while preserving near-optimal performance (often >98% of ), enabling robust, deception-aware operation in critical autonomous systems.

Abstract

Autonomous systems are increasingly expected to operate in the presence of adversaries, though adversaries may infer sensitive information simply by observing a system. Therefore, present a deceptive sequential decision-making framework that not only conceals sensitive information, but actively misleads adversaries about it. We model autonomous systems as Markov decision processes, with adversaries using inverse reinforcement learning to recover reward functions. To counter them, we present three regularization strategies for policy synthesis problems that actively deceive an adversary about a system's reward. ``Diversionary deception'' leads an adversary to draw any false conclusion about the system's reward function. ``Targeted deception'' leads an adversary to draw a specific false conclusion about the system's reward function. ``Equivocal deception'' leads an adversary to infer that the real reward and a false reward both explain the system's behavior. We show how each form of deception can be implemented in policy optimization problems and analytically bound the loss in total accumulated reward induced by deception. Next, we evaluate these developments in a multi-agent setting. We show that diversionary, targeted, and equivocal deception all steer the adversary to false beliefs while still attaining a total accumulated reward that is at least 98% of its optimal, non-deceptive value.

Paper Structure

This paper contains 25 sections, 5 theorems, 50 equations, 5 figures, 4 tables, 1 algorithm.

Key Result

Lemma 1

The following optimization problem computes the optimal value function $v^*(s)$ for all $s \in \mathcal{S}$:

Figures (5)

  • Figure 1: Transition probabilities of each agent in the MTD problem.
  • Figure 2: The adversary's estimate of $L^i$ of each agent under diversionary deception with three types of IRL. Each plot shows results averaged over $100$ runs. We used $\textit{number of trajectories} = 500$ and $\textit{length of trajectories} = 500$ for all simulations, $\textit{epoch} = 20$, $\textit{learning rate} = 0.5$ for MaxEnt IRL, and $\textit{epoch} = 50$, $\textit{learning rate} = 0.1$ and two fully connected hidden layers with dimension $64$ and $32$ for Deep IRL. The blue line(s) represents the real agent(s), while the gray lines represent decoy agents. When $\beta = 0$, no deception is implemented, and the adversary correctly infers that the real agent(s) is the most important. However, with $\beta \geq 0.4$, the adversary's inference of the most important agent will be one of the decoy agents in average, confirming that the policy satisfies the definition of diversionary deception.
  • Figure 3: The adversary's estimate of $L^i$ of each agent under targeted deception with three types of IRL. Each plot shows results averaged over $100$ runs. We used $\textit{number of trajectories} = 500$ and $\textit{length of trajectories} = 500$ for all simulations, $\textit{epoch} = 20$, $\textit{learning rate} = 0.5$ for MaxEnt IRL, and $\textit{epoch} = 50$, $\textit{learning rate} = 0.1$ and two fully connected hidden layers with dimension $64$ and $32$ for Deep IRL. The blue line represents the real agent, the red line(s) represents the target decoy agent(s), and the gray lines represent decoy agents. When $\beta = 0$, no deception is implemented, and the adversary correctly infers that the real agent is the most important. With $\beta\geq 0.02$ the likelihood of the target decoy agent(s) is greater than that of the real agent, implying that deception misleads the adversary into believing that the target decoy agent(s) is the real one. Thus, the policy satisfies the definition of targeted deception.
  • Figure 4: The adversary's estimate of $L^i$ of each agent under equivocal deception with three types of IRL. Each plot shows results averaged over $100$ runs. We used $\textit{number of trajectories} = 500$ and $\textit{length of trajectories} = 500$ for all simulations, $\textit{epoch} = 20$, $\textit{learning rate} = 0.5$ for MaxEnt IRL, and $\textit{epoch} = 50$, $\textit{learning rate} = 0.1$ and two fully connected hidden layers with dimension $64$ and $32$ for Deep IRL. The blue line represents the real agent, the red line(s) represents the target decoy agent(s), and the gray lines represent decoy agents. When $\beta = 0$, no deception is implemented, and the adversary correctly infers that the real agent is the most important. With $\beta\geq 50$ the likelihoods of the real agent and the target decoy agent(s) remain close to each other, which satisfies the definition of equivocal deception.
  • Figure 5: The revenue obtained from deceptive policies under diversionary, targeted, and equivocal deception. In diversionary deception, the system recovers $98.04\%$ and $99.07\%$ of the optimal revenue with $\beta = 0.4$. For targeted deception with $\beta = 0.02$, the system recovers $99.95\%$ and $99.90\%$ of the optimal revenue. For equivocal deception with $\beta = 50$, the system recovers $99.99\%$ and $99.98\%$ of the optimal revenue. In all types of deception, less than $2\%$ of revenue is lost, which outperforms worst-case bounds on performance loss, verifying Theorem \ref{['thm:bound_div']}, \ref{['thm:bound_tar']} and \ref{['thm:bound_equ']}.

Theorems & Definitions (19)

  • Definition 1: Markov Decision Process
  • Definition 2: Multi-Agent Markov Decision Process; boutilier1996planning
  • Remark 1
  • Lemma 1: MDP LP; puterman2014markov
  • Definition 3
  • Definition 4
  • Remark 2
  • Definition 5: Diversionary Deception
  • Definition 6: Targeted Deception
  • Definition 7: Equivocal Deception
  • ...and 9 more