Table of Contents
Fetching ...

Human-AI Safety: A Descendant of Generative AI and Control Systems Safety

Andrea Bajcsy, Jaime F. Fisac

TL;DR

The paper addresses the challenge of safety in human–AI interactions by merging control-theoretic safety guarantees with the representational power of generative AI to account for dynamic feedback loops between users and AI systems. It introduces a formal Human–AI Systems Theory, models the interaction as a dynamical game with a safety value function $V(z^{AI}_0)$ and a safety set ${\Omega}^{*}$, and proposes a Frontier Framework—the Human–AI Safety Filter—that monitors and minimally overrides AI outputs to prevent safety violations. Key contributions include defining safety as the continual satisfaction of human needs, formalizing a zero-sum safety game with Isaacs dynamics, and laying out a scalable safety-filter architecture that can operate in latent spaces using self-play and probabilistic guarantees. The work advances practical, at-scale safety assurances for advanced AI by integrating control theory with AI modeling, with implications for safer deployment, governance, and policy formation in dynamic human–AI ecosystems.

Abstract

Artificial intelligence (AI) is interacting with people at an unprecedented scale, offering new avenues for immense positive impact, but also raising widespread concerns around the potential for individual and societal harm. Today, the predominant paradigm for human--AI safety focuses on fine-tuning the generative model's outputs to better agree with human-provided examples or feedback. In reality, however, the consequences of an AI model's outputs cannot be determined in isolation: they are tightly entangled with the responses and behavior of human users over time. In this paper, we distill key complementary lessons from AI safety and control systems safety, highlighting open challenges as well as key synergies between both fields. We then argue that meaningful safety assurances for advanced AI technologies require reasoning about how the feedback loop formed by AI outputs and human behavior may drive the interaction towards different outcomes. To this end, we introduce a unifying formalism to capture dynamic, safety-critical human--AI interactions and propose a concrete technical roadmap towards next-generation human-centered AI safety.

Human-AI Safety: A Descendant of Generative AI and Control Systems Safety

TL;DR

The paper addresses the challenge of safety in human–AI interactions by merging control-theoretic safety guarantees with the representational power of generative AI to account for dynamic feedback loops between users and AI systems. It introduces a formal Human–AI Systems Theory, models the interaction as a dynamical game with a safety value function and a safety set , and proposes a Frontier Framework—the Human–AI Safety Filter—that monitors and minimally overrides AI outputs to prevent safety violations. Key contributions include defining safety as the continual satisfaction of human needs, formalizing a zero-sum safety game with Isaacs dynamics, and laying out a scalable safety-filter architecture that can operate in latent spaces using self-play and probabilistic guarantees. The work advances practical, at-scale safety assurances for advanced AI by integrating control theory with AI modeling, with implications for safer deployment, governance, and policy formation in dynamic human–AI ecosystems.

Abstract

Artificial intelligence (AI) is interacting with people at an unprecedented scale, offering new avenues for immense positive impact, but also raising widespread concerns around the potential for individual and societal harm. Today, the predominant paradigm for human--AI safety focuses on fine-tuning the generative model's outputs to better agree with human-provided examples or feedback. In reality, however, the consequences of an AI model's outputs cannot be determined in isolation: they are tightly entangled with the responses and behavior of human users over time. In this paper, we distill key complementary lessons from AI safety and control systems safety, highlighting open challenges as well as key synergies between both fields. We then argue that meaningful safety assurances for advanced AI technologies require reasoning about how the feedback loop formed by AI outputs and human behavior may drive the interaction towards different outcomes. To this end, we introduce a unifying formalism to capture dynamic, safety-critical human--AI interactions and propose a concrete technical roadmap towards next-generation human-centered AI safety.
Paper Structure (11 sections, 1 theorem, 5 equations, 5 figures)

This paper contains 11 sections, 1 theorem, 5 equations, 5 figures.

Key Result

Theorem 1

Consider a human--AI system with AI world model $f^\texttt{AI}(z^\texttt{AI},a^\texttt{AI},a^\texttt{H},o^{\texttt{AI}})$ and a safety filter $({\pi^\texttt{AI}_\text{\tiny{*}}},{\Delta},{\phi})$. If the AI agent is deployed with an initial internal state ${z^\texttt{AI}_0 \in \mathcal{Z}^\texttt{AI

Figures (5)

  • Figure 1: We identify a high-value window of opportunity to combine the growing capabilities of generative AI with the robust, interaction-aware dynamical safety frameworks from control theory. This synergy can unlock a new generation of human--AI safety mechanisms that can perform systematic risk mitigation at scale.
  • Figure 2: Examples of safety in embodied human--automation systems vs. human--AI dialogue.
  • Figure 3: Common sense failure identification via GPT-4. Today's web-trained generative AI models show the potential to identify common sense safety hazards from both text and images.
  • Figure 4: (left) The AI always acts under the safety-critical game policy (${\pi^\texttt{AI}_\text{\tiny{*}}}, \pi^\texttt{H}_\dag$), making it safe but conservative. (right) The filtered AI uses task policy ${\pi^\texttt{AI}_{\text{\tiny{\faCheckSquare[regular]}}}}$ as long as in the future it can apply ${\pi^\texttt{AI}_\text{\tiny{*}}}$ against $\pi^\texttt{H}_\dag$.
  • Figure 5: Human--AI Safety Filter. The base AI model encodes the AI's observations into its latent state $z^\texttt{AI}$ which is used as input for its task policy (${\pi^\texttt{AI}_{\text{\tiny{\faCheckSquare[regular]}}}}$). A safety filter includes a learned AI safety strategy ${\pi^\texttt{AI}_\text{\tiny{*}}}$, a safety monitor ${\Delta}$ that predicts safety risks, and a predictive human model containing a virtual adversary $\pi^\texttt{H}_\dag$ that generates pessimistic predictions of human interaction. Based on ${\Delta}$, the AI's outputs to the human are filtered by the intervention scheme ${\phi}$, and modified to guarantee safety.

Theorems & Definitions (1)

  • Theorem 1: General Human--AI Safety Filter