A Black Swan Hypothesis: The Role of Human Irrationality in AI Safety
Hyunin Lee, Chanwoo Park, David Abel, Ming Jin
TL;DR
This work reframes black swan events as phenomena that can arise even in stationary environments due to misperception of reward and likelihood, introducing s-black swans. It builds a formal agent–environment framework with ground MDPs, Human MDPs, and Human-Estimation MDPs to study how perception gaps $\epsilon_r$, $\epsilon_d$ and estimation gaps $\kappa_r$, $\kappa_d$ propagate into suboptimal policies. The authors provide discrete and continuous definitions of s-black swans, prove a nonzero lower bound on the value-function gap between perceived and ground world models, and bound the hitting time for perception corrections, offering a theoretical foundation for safer AI systems that actively correct misperceptions. By analyzing simple sequential settings and extending to general MDPs, the paper lays groundwork for designing algorithms that mitigate tail risks by aligning human and machine reward and probability perceptions, with implications for robust AI safety in real-world decision making.
Abstract
Black swan events are statistically rare occurrences that carry extremely high risks. A typical view of defining black swan events is heavily assumed to originate from an unpredictable time-varying environments; however, the community lacks a comprehensive definition of black swan events. To this end, this paper challenges that the standard view is incomplete and claims that high-risk, statistically rare events can also occur in unchanging environments due to human misperception of their value and likelihood, which we call as spatial black swan event. We first carefully categorize black swan events, focusing on spatial black swan events, and mathematically formalize the definition of black swan events. We hope these definitions can pave the way for the development of algorithms to prevent such events by rationally correcting human perception.
