AI Safety, Alignment, and Ethics (AI SAE)
Dylan Waldner
TL;DR
The paper tackles the problem of sustaining human–AI cooperation as AI capabilities scale by arguing that ethics must be structurally embedded within AI systems. It introduces the moral problem space $\mathcal{M}$ and the instantiated moral representations $M(\theta)$, proposing a governance-embedding-representation pipeline that ties normative learning to system architecture and multi-level oversight. A formal framework links single-system ML objects to population dynamics via replicator theory, outlining how sanctions and subsidies can steer ecosystems toward an aligned, competitive, and symbiotic regime $\mathcal{A}_{\text{ACS}}$. The work surveys metaethical hypotheses—realism, relativism, convergence, and virtue—and lays out methods, hypotheses, and future research directions for operationalizing $M(\theta)$, including Pigouvian governance as a practical mechanism. Overall, the paper presents a comprehensive program to embed normative structure within AI representations and governance to resist alignment failure under evolution and competition, while acknowledging open challenges such as inner alignment, deceptive alignment, and cross-cultural variation.
Abstract
This paper grounds ethics in evolutionary biology, viewing moral norms as adaptive mechanisms that render cooperation fitness-viable under selection pressure. Current alignment approaches add ethics post hoc, treating it as an external constraint rather than embedding it as an evolutionary strategy for cooperation. The central question is whether normative architectures can be embedded directly into AI systems to sustain human--AI cooperation (symbiosis) as capabilities scale. To address this, I propose a governance--embedding--representation pipeline linking moral representation learning to system-level design and institutional governance, treating alignment as a multi-level problem spanning cognition, optimization, and oversight. I formalize moral norm representation through the moral problem space, a learnable subspace in neural representations where cooperative norms can be encoded and causally manipulated. Using sparse autoencoders, activation steering, and causal interventions, I outline a research program for engineering moral representations and embedding them into the full semantic space -- treating competing theories of morality as empirical hypotheses about representation geometry rather than philosophical positions. Governance principles leverage these learned moral representations to regulate how cooperative behaviors evolve within the AI ecosystem. Through replicator dynamics and multi-agent game theory, I model how internal representational features can shape population-level incentives by motivating the design of sanctions and subsidies structured to yield decentralized normative institutions.
