Table of Contents
Fetching ...

A Black Swan Hypothesis: The Role of Human Irrationality in AI Safety

Hyunin Lee, Chanwoo Park, David Abel, Ming Jin

TL;DR

This work reframes black swan events as phenomena that can arise even in stationary environments due to misperception of reward and likelihood, introducing s-black swans. It builds a formal agent–environment framework with ground MDPs, Human MDPs, and Human-Estimation MDPs to study how perception gaps $\epsilon_r$, $\epsilon_d$ and estimation gaps $\kappa_r$, $\kappa_d$ propagate into suboptimal policies. The authors provide discrete and continuous definitions of s-black swans, prove a nonzero lower bound on the value-function gap between perceived and ground world models, and bound the hitting time for perception corrections, offering a theoretical foundation for safer AI systems that actively correct misperceptions. By analyzing simple sequential settings and extending to general MDPs, the paper lays groundwork for designing algorithms that mitigate tail risks by aligning human and machine reward and probability perceptions, with implications for robust AI safety in real-world decision making.

Abstract

Black swan events are statistically rare occurrences that carry extremely high risks. A typical view of defining black swan events is heavily assumed to originate from an unpredictable time-varying environments; however, the community lacks a comprehensive definition of black swan events. To this end, this paper challenges that the standard view is incomplete and claims that high-risk, statistically rare events can also occur in unchanging environments due to human misperception of their value and likelihood, which we call as spatial black swan event. We first carefully categorize black swan events, focusing on spatial black swan events, and mathematically formalize the definition of black swan events. We hope these definitions can pave the way for the development of algorithms to prevent such events by rationally correcting human perception.

A Black Swan Hypothesis: The Role of Human Irrationality in AI Safety

TL;DR

This work reframes black swan events as phenomena that can arise even in stationary environments due to misperception of reward and likelihood, introducing s-black swans. It builds a formal agent–environment framework with ground MDPs, Human MDPs, and Human-Estimation MDPs to study how perception gaps , and estimation gaps , propagate into suboptimal policies. The authors provide discrete and continuous definitions of s-black swans, prove a nonzero lower bound on the value-function gap between perceived and ground world models, and bound the hitting time for perception corrections, offering a theoretical foundation for safer AI systems that actively correct misperceptions. By analyzing simple sequential settings and extending to general MDPs, the paper lays groundwork for designing algorithms that mitigate tail risks by aligning human and machine reward and probability perceptions, with implications for robust AI safety in real-world decision making.

Abstract

Black swan events are statistically rare occurrences that carry extremely high risks. A typical view of defining black swan events is heavily assumed to originate from an unpredictable time-varying environments; however, the community lacks a comprehensive definition of black swan events. To this end, this paper challenges that the standard view is incomplete and claims that high-risk, statistically rare events can also occur in unchanging environments due to human misperception of their value and likelihood, which we call as spatial black swan event. We first carefully categorize black swan events, focusing on spatial black swan events, and mathematically formalize the definition of black swan events. We hope these definitions can pave the way for the development of algorithms to prevent such events by rationally correcting human perception.
Paper Structure (32 sections, 11 theorems, 68 equations, 2 figures, 1 algorithm)

This paper contains 32 sections, 11 theorems, 68 equations, 2 figures, 1 algorithm.

Key Result

Proposition 1

If $(s, a, t_{bs})$ is a black swan event, then there exists a time interval $[t_1, t_2] \subseteq [T]$ such that for every $t \in [t_1, t_2]$, the $(s, a, t)$ is classified as s-black swan .

Figures (2)

  • Figure 1: Value distortion function $u$ and probability distortion function $w$. The gray line in Figures \ref{['fig:Utility function']} and \ref{['fig:Weight function']} represents $y = x$.
  • Figure 2: The agent and environment intersect with perception.

Theorems & Definitions (28)

  • Example 1: Insurance policies
  • Definition 1: Value Distortion Function
  • Definition 2: Probability Distortion Function
  • Definition 3: Black Swan Event Dimension
  • Proposition 1
  • Example 2
  • Remark 1
  • Theorem 1: One-Step Optimality Deviation
  • Theorem 2: Multi-step Optimality Deviation with $|{\mathcal{S}}|=2$
  • Theorem 3: Two-step Optimality Deviation with $|{\mathcal{S}}|=3$
  • ...and 18 more