Table of Contents
Fetching ...

On the Undecidability of Artificial Intelligence Alignment: Machines that Halt

Gabriel Adriano de Melo, Marcos Ricardo Omena De Albuquerque Maximo, Nei Yoshihiro Soma, Paulo Andre Lima de Castro

TL;DR

This work argues that the alignment should be a guaranteed property from the AI architecture rather than a characteristic imposed post-hoc on an arbitrary AI model, and proposes that such a function must also impose a halting constraint that guarantees that the AI model always reaches a terminal state in finite execution steps.

Abstract

The inner alignment problem, which asserts whether an arbitrary artificial intelligence (AI) model satisfices a non-trivial alignment function of its outputs given its inputs, is undecidable. This is rigorously proved by Rice's theorem, which is also equivalent to a reduction to Turing's Halting Problem, whose proof sketch is presented in this work. Nevertheless, there is an enumerable set of provenly aligned AIs that are constructed from a finite set of provenly aligned operations. Therefore, we argue that the alignment should be a guaranteed property from the AI architecture rather than a characteristic imposed post-hoc on an arbitrary AI model. Furthermore, while the outer alignment problem is the definition of a judge function that captures human values and preferences, we propose that such a function must also impose a halting constraint that guarantees that the AI model always reaches a terminal state in finite execution steps. Our work presents examples and models that illustrate this constraint and the intricate challenges involved, advancing a compelling case for adopting an intrinsically hard-aligned approach to AI systems architectures that ensures halting.

On the Undecidability of Artificial Intelligence Alignment: Machines that Halt

TL;DR

This work argues that the alignment should be a guaranteed property from the AI architecture rather than a characteristic imposed post-hoc on an arbitrary AI model, and proposes that such a function must also impose a halting constraint that guarantees that the AI model always reaches a terminal state in finite execution steps.

Abstract

The inner alignment problem, which asserts whether an arbitrary artificial intelligence (AI) model satisfices a non-trivial alignment function of its outputs given its inputs, is undecidable. This is rigorously proved by Rice's theorem, which is also equivalent to a reduction to Turing's Halting Problem, whose proof sketch is presented in this work. Nevertheless, there is an enumerable set of provenly aligned AIs that are constructed from a finite set of provenly aligned operations. Therefore, we argue that the alignment should be a guaranteed property from the AI architecture rather than a characteristic imposed post-hoc on an arbitrary AI model. Furthermore, while the outer alignment problem is the definition of a judge function that captures human values and preferences, we propose that such a function must also impose a halting constraint that guarantees that the AI model always reaches a terminal state in finite execution steps. Our work presents examples and models that illustrate this constraint and the intricate challenges involved, advancing a compelling case for adopting an intrinsically hard-aligned approach to AI systems architectures that ensures halting.
Paper Structure (12 sections, 4 figures)

This paper contains 12 sections, 4 figures.

Figures (4)

  • Figure 1: Construction of an adversarial model that would fool any program that claims to solve the decidability of the alignment problem.
  • Figure 2: The architecture of a model that is guaranteed to be aligned with respect to the judge functions $J_v(i, o)$ by a filtering procedure.
  • Figure 3: A decidable halting model (LLM) in a loop may result in an undecidable halting final AI system (agent).
  • Figure 4: Final system decidability is guaranteed by the $\theta$ parameter that trivializes the LLM after a finite number of iterations.