A Black Swan Hypothesis: The Role of Human Irrationality in AI Safety

Hyunin Lee; Chanwoo Park; David Abel; Ming Jin

A Black Swan Hypothesis: The Role of Human Irrationality in AI Safety

Hyunin Lee, Chanwoo Park, David Abel, Ming Jin

TL;DR

This work reframes black swan events as phenomena that can arise even in stationary environments due to misperception of reward and likelihood, introducing s-black swans. It builds a formal agent–environment framework with ground MDPs, Human MDPs, and Human-Estimation MDPs to study how perception gaps $\epsilon_r$, $\epsilon_d$ and estimation gaps $\kappa_r$, $\kappa_d$ propagate into suboptimal policies. The authors provide discrete and continuous definitions of s-black swans, prove a nonzero lower bound on the value-function gap between perceived and ground world models, and bound the hitting time for perception corrections, offering a theoretical foundation for safer AI systems that actively correct misperceptions. By analyzing simple sequential settings and extending to general MDPs, the paper lays groundwork for designing algorithms that mitigate tail risks by aligning human and machine reward and probability perceptions, with implications for robust AI safety in real-world decision making.

Abstract

Black swan events are statistically rare occurrences that carry extremely high risks. A typical view of defining black swan events is heavily assumed to originate from an unpredictable time-varying environments; however, the community lacks a comprehensive definition of black swan events. To this end, this paper challenges that the standard view is incomplete and claims that high-risk, statistically rare events can also occur in unchanging environments due to human misperception of their value and likelihood, which we call as spatial black swan event. We first carefully categorize black swan events, focusing on spatial black swan events, and mathematically formalize the definition of black swan events. We hope these definitions can pave the way for the development of algorithms to prevent such events by rationally correcting human perception.

A Black Swan Hypothesis: The Role of Human Irrationality in AI Safety

TL;DR

and estimation gaps

propagate into suboptimal policies. The authors provide discrete and continuous definitions of s-black swans, prove a nonzero lower bound on the value-function gap between perceived and ground world models, and bound the hitting time for perception corrections, offering a theoretical foundation for safer AI systems that actively correct misperceptions. By analyzing simple sequential settings and extending to general MDPs, the paper lays groundwork for designing algorithms that mitigate tail risks by aligning human and machine reward and probability perceptions, with implications for robust AI safety in real-world decision making.

Abstract

Paper Structure (32 sections, 11 theorems, 68 equations, 2 figures, 1 algorithm)

This paper contains 32 sections, 11 theorems, 68 equations, 2 figures, 1 algorithm.

Introduction
Preliminary
Notations.
Markov Decision Process.
Expected Utility Theory.
Prospect Theory.
Cumulative Prospect Theory.
Black Swan in stationary and non-stationary environments
The Emergence of s-black swan in Sequential Decision Making
Case 1. Contextual Bandit ($T=1$)
Case 2. $|\mathcal{S}|=2$ when $T >1$
Case 3. $|S|=3$ with unbiased reward perception
Agent- Environment framework [Chanwoo: do you wanna include agent enviroenment boundary in this paper?]: perception as intersection
Human MDP
Human-Estimation MDP
...and 17 more sections

Key Result

Proposition 1

If $(s, a, t_{bs})$ is a black swan event, then there exists a time interval $[t_1, t_2] \subseteq [T]$ such that for every $t \in [t_1, t_2]$, the $(s, a, t)$ is classified as s-black swan .

Figures (2)

Figure 1: Value distortion function $u$ and probability distortion function $w$. The gray line in Figures \ref{['fig:Utility function']} and \ref{['fig:Weight function']} represents $y = x$.
Figure 2: The agent and environment intersect with perception.

Theorems & Definitions (28)

Example 1: Insurance policies
Definition 1: Value Distortion Function
Definition 2: Probability Distortion Function
Definition 3: Black Swan Event Dimension
Proposition 1
Example 2
Remark 1
Theorem 1: One-Step Optimality Deviation
Theorem 2: Multi-step Optimality Deviation with $|{\mathcal{S}}|=2$
Theorem 3: Two-step Optimality Deviation with $|{\mathcal{S}}|=3$
...and 18 more

A Black Swan Hypothesis: The Role of Human Irrationality in AI Safety

TL;DR

Abstract

A Black Swan Hypothesis: The Role of Human Irrationality in AI Safety

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (28)