Table of Contents
Fetching ...

Incentives for Responsiveness, Instrumental Control and Impact

Ryan Carey, Eric Langlois, Chris van Merwijk, Shane Legg, Tom Everitt

TL;DR

Addresses how to formalize and diagnose AI incentives to ensure safety and fairness. Introduces three incentive notions—response incentives, instrumental control incentives, and impact incentives— with complete graphical criteria to detect their presence from causal graphs. Provides an immateriality criterion, a RI criterion, an ICI criterion (equivalent to intent), and an II criterion, and discusses generalizations to multi-decision settings and the relation to path-specific objectives and safe-AI proposals. The framework aids understanding of convergent instrumental goals and informs mechanisms to steer agent behavior toward safe and fair outcomes.

Abstract

We introduce three concepts that describe an agent's incentives: response incentives indicate which variables in the environment, such as sensitive demographic information, affect the decision under the optimal policy. Instrumental control incentives indicate whether an agent's policy is chosen to manipulate part of its environment, such as the preferences or instructions of a user. Impact incentives indicate which variables an agent will affect, intentionally or otherwise. For each concept, we establish sound and complete graphical criteria, and discuss general classes of techniques that may be used to produce incentives for safe and fair agent behaviour. Finally, we outline how these notions may be generalised to multi-decision settings. This journal-length paper extends our conference publications "Incentives for Responsiveness, Instrumental Control and Impact" and "Agent Incentives: A Causal Perspective": the material on response incentives and instrumental control incentives is updated, while the work on impact incentives and multi-decision settings is entirely new.

Incentives for Responsiveness, Instrumental Control and Impact

TL;DR

Addresses how to formalize and diagnose AI incentives to ensure safety and fairness. Introduces three incentive notions—response incentives, instrumental control incentives, and impact incentives— with complete graphical criteria to detect their presence from causal graphs. Provides an immateriality criterion, a RI criterion, an ICI criterion (equivalent to intent), and an II criterion, and discusses generalizations to multi-decision settings and the relation to path-specific objectives and safe-AI proposals. The framework aids understanding of convergent instrumental goals and informs mechanisms to steer agent behavior toward safe and fair outcomes.

Abstract

We introduce three concepts that describe an agent's incentives: response incentives indicate which variables in the environment, such as sensitive demographic information, affect the decision under the optimal policy. Instrumental control incentives indicate whether an agent's policy is chosen to manipulate part of its environment, such as the preferences or instructions of a user. Impact incentives indicate which variables an agent will affect, intentionally or otherwise. For each concept, we establish sound and complete graphical criteria, and discuss general classes of techniques that may be used to produce incentives for safe and fair agent behaviour. Finally, we outline how these notions may be generalised to multi-decision settings. This journal-length paper extends our conference publications "Incentives for Responsiveness, Instrumental Control and Impact" and "Agent Incentives: A Causal Perspective": the material on response incentives and instrumental control incentives is updated, while the work on impact incentives and multi-decision settings is entirely new.

Paper Structure

This paper contains 2 sections, 1 figure.

Figures (1)

  • Figure :