Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization
Antoine Maier, Aude Maier, Tom David
TL;DR
The paper argues that the common Objective Satisfaction Assumption (OSA)—that training yields models that truly optimize the specified objective—fails in realistic settings due to approximation, estimation, and optimization errors, as well as unavoidable misspecification of human intent. Using a learning-paradigm-agnostic framework, it decomposes excess risk into three nonzero components and shows how finite capacity, finite data, and imperfect optimization guarantee deviations from the intended objective; it also highlights performativity and outer vs inner misalignment as entrenched challenges. By connecting these gaps to Goodhart's law, the authors show that under strong optimization pressure proxy targets can diverge from true goals, potentially causing loss of control in General-Purpose AI (GPAI) systems. The work argues for principled limits on GPAI optimization—such as halting criteria or capacity constraints—since the location of the Goodhart breakpoint is intractable to locate in advance, and thus proactive safeguards are essential for safety and robustness.
Abstract
A common but rarely examined assumption in machine learning is that training yields models that actually satisfy their specified objective function. We call this the Objective Satisfaction Assumption (OSA). Although deviations from OSA are acknowledged, their implications are overlooked. We argue, in a learning-paradigm-agnostic framework, that OSA fails in realistic conditions: approximation, estimation, and optimization errors guarantee systematic deviations from the intended objective, regardless of the quality of its specification. Beyond these technical limitations, perfectly capturing and translating the developer's intent, such as alignment with human preferences, into a formal objective is practically impossible, making misspecification inevitable. Building on recent mathematical results, absent a mathematical characterization of these gaps, they are indistinguishable from those that collapse into Goodhart's law failure modes under strong optimization pressure. Because the Goodhart breaking point cannot be located ex ante, a principled limit on the optimization of General-Purpose AI systems is necessary. Absent such a limit, continued optimization is liable to push systems into predictable and irreversible loss of control.
