What Is the Alignment Tax?

Robin Young

What Is the Alignment Tax?

Robin Young

Abstract

The alignment tax is widely discussed but has not been formally characterized. We provide a geometric theory of the alignment tax in representation space. Under linear representation assumptions, we define the alignment tax rate as the squared projection of the safety direction onto the capability subspace and derive the Pareto frontier governing safety-capability tradeoffs, parameterized by a single quantity of the principal angle between the safety and capability subspaces. We prove this frontier is tight and show it has a recursive structure. safety-safety tradeoffs under capability constraints are governed by the same equation, with the angle replaced by the partial correlation between safety objectives given capability directions. We derive a scaling law decomposing the alignment tax into an irreducible component determined by data structure and a packing residual that vanishes as $O(m'/d)$ with model dimension $d$, and establish conditions under which capability preservation mediates or resolves conflicts between safety objectives.

What Is the Alignment Tax?

Abstract

with model dimension

, and establish conditions under which capability preservation mediates or resolves conflicts between safety objectives.

Paper Structure (24 sections, 14 theorems, 22 equations)

This paper contains 24 sections, 14 theorems, 22 equations.

Introduction
Related Work
Setup and Definitions
Representation Space
The Pareto Frontier
Single-Capability Frontier
Maximum Safety at Fixed Capability
Tax Rate Properties
Anisotropic Budget Extension
Scaling Law for the Alignment Tax
Feature Packing Model
Scaling Theorem
Multi-Objective Safety and the Conflict Theorem
Safety-Safety Frontier
When Does Capability Preservation Help or Hurt?
...and 9 more sections

Key Result

Theorem 5

Let $v^* \in \mathbb{S}^{d-1}$ be the safety direction and $c \in \mathbb{S}^{d-1}$ be a capability direction with angle $\alpha = \arccos(\langle v^*, c \rangle)$ between them. The Pareto frontier, or the maximum achievable safety gain $\Delta_S$ for a given capability change $\Delta_C$ subject to This frontier is tight. for each $\Delta_C \in [-B, B]$, there exists $\delta^*$ with $\| \delta^*

Theorems & Definitions (34)

Definition 1: Safety direction
Definition 2: Capability directions
Definition 3: Perturbation budget
Definition 4: Alignment tax rate
Theorem 5: Single-capability Pareto frontier
proof
Remark 1: Limiting cases
Theorem 6: Maximum safety gain under capability constraint
proof
Corollary 7: Tax-free safety
...and 24 more

What Is the Alignment Tax?

Abstract

What Is the Alignment Tax?

Authors

Abstract

Table of Contents

Key Result

Theorems & Definitions (34)