Table of Contents
Fetching ...

$\aleph$-IPOMDP: Mitigating Deception in a Cognitive Hierarchy with Off-Policy Counterfactual Anomaly Detection

Nitay Alon, Joseph M. Barnby, Stefan Sarkadi, Lion Schulz, Jeffrey S. Rosenschein, Peter Dayan

TL;DR

A computational framework called $\aleph$-IPOMDP is proposed, which augments the Bayesian inference of model-based RL agents with an anomaly detection algorithm and an out-of-belief policy that allows agents to realize that they are being deceived, even if they cannot understand how, and to deter opponents via a credible threat.

Abstract

Social agents with finitely nested opponent models are vulnerable to manipulation by agents with deeper recursive capabilities. This imbalance, rooted in logic and the theory of recursive modelling frameworks, cannot be solved directly. We propose a computational framework called $\aleph$-IPOMDP, which augments the Bayesian inference of model-based RL agents with an anomaly detection algorithm and an out-of-belief policy. Our mechanism allows agents to realize that they are being deceived, even if they cannot understand how, and to deter opponents via a credible threat. We test this framework in both a mixed-motive and a zero-sum game. Our results demonstrate the $\aleph$-mechanism's effectiveness, leading to more equitable outcomes and less exploitation by more sophisticated agents. We discuss implications for AI safety, cybersecurity, cognitive science, and psychiatry.

$\aleph$-IPOMDP: Mitigating Deception in a Cognitive Hierarchy with Off-Policy Counterfactual Anomaly Detection

TL;DR

A computational framework called -IPOMDP is proposed, which augments the Bayesian inference of model-based RL agents with an anomaly detection algorithm and an out-of-belief policy that allows agents to realize that they are being deceived, even if they cannot understand how, and to deter opponents via a credible threat.

Abstract

Social agents with finitely nested opponent models are vulnerable to manipulation by agents with deeper recursive capabilities. This imbalance, rooted in logic and the theory of recursive modelling frameworks, cannot be solved directly. We propose a computational framework called -IPOMDP, which augments the Bayesian inference of model-based RL agents with an anomaly detection algorithm and an out-of-belief policy. Our mechanism allows agents to realize that they are being deceived, even if they cannot understand how, and to deter opponents via a credible threat. We test this framework in both a mixed-motive and a zero-sum game. Our results demonstrate the -mechanism's effectiveness, leading to more equitable outcomes and less exploitation by more sophisticated agents. We discuss implications for AI safety, cybersecurity, cognitive science, and psychiatry.
Paper Structure (30 sections, 42 equations, 12 figures, 3 algorithms)

This paper contains 30 sections, 42 equations, 12 figures, 3 algorithms.

Figures (12)

  • Figure 1: Paper overview: (Cognitive Hierarchy:) We model agents with finite recursive opponent modelling with different Depth of Mentalising (DoM). In the classic model, the player at DoM(0) is at the mercy of the DoM(1) partner given that the DoM(0) player cannot form nested beliefs about their opponent. ($\aleph$-IPOMDP:) The DoM(0) can overcome its recursive limitations by augmenting the classic model. We augment agents' inference processes with an anomaly detection mechanism that allows a self to detect deceptive others by matching expectations with observations. (IUG:) Agents with different degrees of DoM interact in the iterated ultimatum game (IUG). In the IUG, on each trial, a Sender offers a split of endowed money, and the Receiver decides whether to accept. If the Receiver decides to reject the offer, both players get 0. (Row/column game:) Agents interact in an iterated Bayesian zero-sum game. In this game, nature selects a payoff matrix ($G^1$ or $G^2$). Only the row player knows which payoff matrix is sampled and uses this information to their advantage. The column player makes inferences about the payoff matrix from the row player's behaviour.
  • Figure 2: DoM$(0)$ vs DoM$(-1)$ in IUG IPOMDP: (A) The points show offers from the sender to the receiver over all 12 trials, coloured by sender behavioural type (random or utility). Points are shown in white if the receiver rejects the offer. A DoM$(0)$ receiver quickly infers from the initial offers the type of the DoM$(-1)$ sender. The DoM$(-1)$ utility sender's first offer tells it apart from the random sender as its initial offer is always close to $0$ (up to random noise). The DoM$(0)$ policy is a function of its updated beliefs (B). Updated belief probabilities of the receiver when playing with different random or utility senders. DoM$(0)$ receivers are well tuned to detect which type of sender they are partnered with. When engaging with a threshold DoM$(-1)$ sender, the receiver rejects the offers until the sender is unwilling to "improve" its offers, which also corresponds to the certainty of its beliefs
  • Figure 3: Illustration of deception in IUG: (A) Points show offers from the sender to the receiver over all 12 trials, coloured by sender behavioural type (random or utility). Points are shaded white if the receiver rejects the offer. The DoM$(1)$ acts in a deceptive way to masquerade itself as a random sender, hacking the DoM$(0)$ Bayesian IRL. It starts with a relatively high first offer, and then decreases sharply. (B) Updated belief probabilities of the receiver when playing with different random or utility senders. DoM$(0)$ receivers are poorly tuned to detect the type of DoM$(1)$ sender with which they are partnered, mistaking all actions as if they came from the random sender. This stratagem employed by the sender exploits the pitfall of Bayesian inference used by the DoM$(0)$---the likelihood that any offer sequence is equal for the random sender. (C) Comparing the expected reward from the DoM$(0)$ receiver's perspective $E(\hat{r}_\nu^t)$ (striped bars) to the observed reward (non striped bars) each trial, This measures the advantage of the deceiver's policy, reflecting the deceiver's ability to increase its reward at the expense of the victim.
  • Figure 4: Illustration of the $\aleph$-mechanism algorithm (Top Left) In IPOMDP, DoM$(k)$ agents use their nested model to compute best-response against DoM$(k-1)$ agents, taking advantage of their superior mentalising abilities. (Top Right) $\aleph$-IPOMDP augmentations allow agents to detect when their partners might be hierarchically superior, utilizing anomaly detection methods to avoid exploitation. (Bottom) The $\aleph$-mechanism combines expectation-observation monitoring with typicality (policy predictions) to verify that the observed agent acts "as expected". This can vary by agents, with individual differences dictating how much 'evidence' of anomalies is required before the mechanism is triggered, and also how narrow or broad the typical set is.
  • Figure 5: Mitigation of deception in IUG with $\aleph$-IPOMDP: Points represent offers from the sender to the receiver across trials, biased by the sender's threshold. Points are shaded white if the receiver rejects the offer, triangular points indicate that the rejection is caused by the $\aleph$-mechanism, effectively terminating the interaction. Lines and points are visible while the $\aleph$-mechanism is off. Top row $\delta=0.1$, $\omega=0.3$ --- (A) Notably, both DoM$(1)$ senders masquerade as being random. However, their ability to execute the "random" behaviour ruse is limited by both $\aleph$-mechanism components. First, the cumulative reward has to satisfy the off-policy counterfactual reward component. Next, the variability of the offers is higher than in the IPOMDP case, respecting the typicality component. Ultimately, the deceiver's policy triggers the average reward monitoring ($Z^2$) component, as the observed reward is lower then expected (marked by the truncation of the line and points). (B) Cumulative reward ratio for the sender vs. the receiver. The $\aleph$-IPOMDP reduces the cumulative reward ratio (sender/receiver) by more than $40\%$. Bottom row $\delta=0.3$, $\omega=0.3$ --- (C) When it is constrained by narrower strong typicality set $(Z^1)$ bounds, the DoM$(1)$ with low threshold terminates the interaction faster than before, triggering this component after 6 trials, while the high threshold sender acts similarly.(D) Even when the interaction is shorter, the reward ratio is still reduced compared to the case of the conventional IPOMDP.
  • ...and 7 more figures