What is Harm? Baby Don't Hurt Me! On the Impossibility of Complete Harm Specification in AI Alignment

Robin Young

What is Harm? Baby Don't Hurt Me! On the Impossibility of Complete Harm Specification in AI Alignment

Robin Young

TL;DR

This work argues that complete specification of harm for AI is fundamentally impossible due to an information-theoretic gap between ground-truth harm $H(O)$ and the information captured by a system's specifications $I$. It introduces semantic entropy $H(S)$ and the safety-capability ratio $I(O;I)/H(O)$, provides a four-level formalization culminating in the central inequality $I(O;I) < H(O)$, and offers a practical shift toward uncertainty-aware architectures and governance. The authors present three key contributions: (i) a rigorous framework for quantifying specification limits, (ii) a falsifiable metric to assess alignment progress, and (iii) concrete examples (e.g., medical triage, autonomous systems) showing how irreducible semantic entropy manifests in real-world settings. Together, these results motivate reorienting AI alignment from pursuing complete normative specifications to building systems that operate safely despite persistent specification uncertainty.

Abstract

"First, do no harm" faces a fundamental challenge in artificial intelligence: how can we specify what constitutes harm? While prior work treats harm specification as a technical hurdle to be overcome through better algorithms or more data, we argue this assumption is unsound. Drawing on information theory, we demonstrate that complete harm specification is fundamentally impossible for any system where harm is defined external to its specifications. This impossibility arises from an inescapable information-theoretic gap: the entropy of harm H(O) always exceeds the mutual information I(O;I) between ground truth harm O and a system's specifications I. We introduce two novel metrics: semantic entropy H(S) and the safety-capability ratio I(O;I)/H(O), to quantify these limitations. Through a progression of increasingly sophisticated specification attempts, we show why each approach must fail and why the resulting gaps are not mere engineering challenges but fundamental constraints akin to the halting problem. These results suggest a paradigm shift: rather than pursuing complete specifications, AI alignment research should focus on developing systems that can operate safely despite irreducible specification uncertainty.

What is Harm? Baby Don't Hurt Me! On the Impossibility of Complete Harm Specification in AI Alignment

TL;DR

This work argues that complete specification of harm for AI is fundamentally impossible due to an information-theoretic gap between ground-truth harm

and the information captured by a system's specifications

. It introduces semantic entropy

and the safety-capability ratio

, provides a four-level formalization culminating in the central inequality

, and offers a practical shift toward uncertainty-aware architectures and governance. The authors present three key contributions: (i) a rigorous framework for quantifying specification limits, (ii) a falsifiable metric to assess alignment progress, and (iii) concrete examples (e.g., medical triage, autonomous systems) showing how irreducible semantic entropy manifests in real-world settings. Together, these results motivate reorienting AI alignment from pursuing complete normative specifications to building systems that operate safely despite persistent specification uncertainty.

What is Harm? Baby Don't Hurt Me! On the Impossibility of Complete Harm Specification in AI Alignment

TL;DR

Abstract

What is Harm? Baby Don't Hurt Me! On the Impossibility of Complete Harm Specification in AI Alignment

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (5)