Table of Contents
Fetching ...

What Is the Alignment Tax?

Robin Young

Abstract

The alignment tax is widely discussed but has not been formally characterized. We provide a geometric theory of the alignment tax in representation space. Under linear representation assumptions, we define the alignment tax rate as the squared projection of the safety direction onto the capability subspace and derive the Pareto frontier governing safety-capability tradeoffs, parameterized by a single quantity of the principal angle between the safety and capability subspaces. We prove this frontier is tight and show it has a recursive structure. safety-safety tradeoffs under capability constraints are governed by the same equation, with the angle replaced by the partial correlation between safety objectives given capability directions. We derive a scaling law decomposing the alignment tax into an irreducible component determined by data structure and a packing residual that vanishes as $O(m'/d)$ with model dimension $d$, and establish conditions under which capability preservation mediates or resolves conflicts between safety objectives.

What Is the Alignment Tax?

Abstract

The alignment tax is widely discussed but has not been formally characterized. We provide a geometric theory of the alignment tax in representation space. Under linear representation assumptions, we define the alignment tax rate as the squared projection of the safety direction onto the capability subspace and derive the Pareto frontier governing safety-capability tradeoffs, parameterized by a single quantity of the principal angle between the safety and capability subspaces. We prove this frontier is tight and show it has a recursive structure. safety-safety tradeoffs under capability constraints are governed by the same equation, with the angle replaced by the partial correlation between safety objectives given capability directions. We derive a scaling law decomposing the alignment tax into an irreducible component determined by data structure and a packing residual that vanishes as with model dimension , and establish conditions under which capability preservation mediates or resolves conflicts between safety objectives.
Paper Structure (24 sections, 14 theorems, 22 equations)

This paper contains 24 sections, 14 theorems, 22 equations.

Key Result

Theorem 5

Let $v^* \in \mathbb{S}^{d-1}$ be the safety direction and $c \in \mathbb{S}^{d-1}$ be a capability direction with angle $\alpha = \arccos(\langle v^*, c \rangle)$ between them. The Pareto frontier, or the maximum achievable safety gain $\Delta_S$ for a given capability change $\Delta_C$ subject to This frontier is tight. for each $\Delta_C \in [-B, B]$, there exists $\delta^*$ with $\| \delta^*

Theorems & Definitions (34)

  • Definition 1: Safety direction
  • Definition 2: Capability directions
  • Definition 3: Perturbation budget
  • Definition 4: Alignment tax rate
  • Theorem 5: Single-capability Pareto frontier
  • proof
  • Remark 1: Limiting cases
  • Theorem 6: Maximum safety gain under capability constraint
  • proof
  • Corollary 7: Tax-free safety
  • ...and 24 more