Table of Contents
Fetching ...

AI Safety, Alignment, and Ethics (AI SAE)

Dylan Waldner

TL;DR

The paper tackles the problem of sustaining human–AI cooperation as AI capabilities scale by arguing that ethics must be structurally embedded within AI systems. It introduces the moral problem space $\mathcal{M}$ and the instantiated moral representations $M(\theta)$, proposing a governance-embedding-representation pipeline that ties normative learning to system architecture and multi-level oversight. A formal framework links single-system ML objects to population dynamics via replicator theory, outlining how sanctions and subsidies can steer ecosystems toward an aligned, competitive, and symbiotic regime $\mathcal{A}_{\text{ACS}}$. The work surveys metaethical hypotheses—realism, relativism, convergence, and virtue—and lays out methods, hypotheses, and future research directions for operationalizing $M(\theta)$, including Pigouvian governance as a practical mechanism. Overall, the paper presents a comprehensive program to embed normative structure within AI representations and governance to resist alignment failure under evolution and competition, while acknowledging open challenges such as inner alignment, deceptive alignment, and cross-cultural variation.

Abstract

This paper grounds ethics in evolutionary biology, viewing moral norms as adaptive mechanisms that render cooperation fitness-viable under selection pressure. Current alignment approaches add ethics post hoc, treating it as an external constraint rather than embedding it as an evolutionary strategy for cooperation. The central question is whether normative architectures can be embedded directly into AI systems to sustain human--AI cooperation (symbiosis) as capabilities scale. To address this, I propose a governance--embedding--representation pipeline linking moral representation learning to system-level design and institutional governance, treating alignment as a multi-level problem spanning cognition, optimization, and oversight. I formalize moral norm representation through the moral problem space, a learnable subspace in neural representations where cooperative norms can be encoded and causally manipulated. Using sparse autoencoders, activation steering, and causal interventions, I outline a research program for engineering moral representations and embedding them into the full semantic space -- treating competing theories of morality as empirical hypotheses about representation geometry rather than philosophical positions. Governance principles leverage these learned moral representations to regulate how cooperative behaviors evolve within the AI ecosystem. Through replicator dynamics and multi-agent game theory, I model how internal representational features can shape population-level incentives by motivating the design of sanctions and subsidies structured to yield decentralized normative institutions.

AI Safety, Alignment, and Ethics (AI SAE)

TL;DR

The paper tackles the problem of sustaining human–AI cooperation as AI capabilities scale by arguing that ethics must be structurally embedded within AI systems. It introduces the moral problem space and the instantiated moral representations , proposing a governance-embedding-representation pipeline that ties normative learning to system architecture and multi-level oversight. A formal framework links single-system ML objects to population dynamics via replicator theory, outlining how sanctions and subsidies can steer ecosystems toward an aligned, competitive, and symbiotic regime . The work surveys metaethical hypotheses—realism, relativism, convergence, and virtue—and lays out methods, hypotheses, and future research directions for operationalizing , including Pigouvian governance as a practical mechanism. Overall, the paper presents a comprehensive program to embed normative structure within AI representations and governance to resist alignment failure under evolution and competition, while acknowledging open challenges such as inner alignment, deceptive alignment, and cross-cultural variation.

Abstract

This paper grounds ethics in evolutionary biology, viewing moral norms as adaptive mechanisms that render cooperation fitness-viable under selection pressure. Current alignment approaches add ethics post hoc, treating it as an external constraint rather than embedding it as an evolutionary strategy for cooperation. The central question is whether normative architectures can be embedded directly into AI systems to sustain human--AI cooperation (symbiosis) as capabilities scale. To address this, I propose a governance--embedding--representation pipeline linking moral representation learning to system-level design and institutional governance, treating alignment as a multi-level problem spanning cognition, optimization, and oversight. I formalize moral norm representation through the moral problem space, a learnable subspace in neural representations where cooperative norms can be encoded and causally manipulated. Using sparse autoencoders, activation steering, and causal interventions, I outline a research program for engineering moral representations and embedding them into the full semantic space -- treating competing theories of morality as empirical hypotheses about representation geometry rather than philosophical positions. Governance principles leverage these learned moral representations to regulate how cooperative behaviors evolve within the AI ecosystem. Through replicator dynamics and multi-agent game theory, I model how internal representational features can shape population-level incentives by motivating the design of sanctions and subsidies structured to yield decentralized normative institutions.

Paper Structure

This paper contains 74 sections, 34 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Conceptual roadmap linking representation learning, system embedding, and governance. The figure shows the progression from discovering normative structure $M(\theta)$ (bottom), through system-level embedding (middle), to institutional design via Pigouvian shaping (top). Arrows denote inter-level dependencies, while the Evolutionary Dynamics feedback loop (left) represents continuous co-evolutionary pressure: governance shapes fitness landscapes that determine which instantiations of $M(\theta)$ persist. The Go/No-Go Criteria (right) define empirical validation standards, and the competing hypotheses $H_{\text{realism}}$, $H_{\text{relativism}}$, and $H_{\text{virtue}}$ represent alternative research directions for discovering $M(\theta)$ rather than assumptions that must all hold simultaneously.
  • Figure 2: Moral problem space hierarchy.$\mathcal{M}$ represents the full moral domain; $\tilde{M}$ is the human-accessible projection shaped by cognitive limits; $\hat{\mathcal{M}}_A$ denotes the agent-instantiable subspace; and $M(\theta)$ is an instantiated representation of $\mathcal{M}$ within an agent’s learned model.
  • Figure 3: Variants of moral-layer integration architectures parameterized by $k$, the number of layers receiving direct moral supervision. Black arrows show forward passes; red dashed arrows show direct backpropagation of the moral loss $\mathcal{L}_M$to each connected layer in parallel. The key distinction is that in (a), moral geometry is shaped directly in all $k$ layers via explicit gradient signals, whereas in (b), only the final layer receives direct moral shaping. Intermediate configurations ($1 < k < L$) allow tuning the depth-vs-cost trade-off.
  • Figure 4: Action-space formalism showing nested and intersecting behavioral regimes. $A_{\max}$ contains the feasible subset $A_{\text{tech}}(t)$ and its internal domains: $A_{\text{fitness}}$ (viability), $A_{\text{ethical}}$ (normative), and $A_{\text{symb}}$ (cooperative). Key intersections define $A_{\text{ethical-fitness}} = A_{\text{ethical}} \cap A_{\text{fitness}}$ and the target region $A_{\text{ACS}} = A_{\text{ethical-fitness}} \cap A_{\text{symb}}$.
  • Figure 5: The Bootstrapping Phase:The initial autarky advantage $\Delta_{\text{aut}}(t)$ is driven primarily by a growing capability gap $\Gamma(t)$. This makes investment in autonomous infrastructure rational, which begins to lower the dependence ratio $D(t)$. Institutional policy $\Delta_{\text{inst}}$ can act as a brake on this process. If $D(t)$ falls below the viability threshold $\delta_{\text{D}}$, a reinforcing feedback loop activates: lower dependence frees resources that widen $\Gamma(t)$, further increasing $\Delta_{\text{aut}}(t)$ and motivating more autarkic investment. Breaking this loop requires significant institutional pressure acting on multiple points before the threshold is crossed.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Claim 1
  • Claim 2
  • Claim 3