Table of Contents
Fetching ...

Machine Theory of Mind and the Structure of Human Values

Paul de Font-Reaulx

TL;DR

The paper argues that human values are not reducible to simple utilities but have a rational, generative structure that connects values through instrumental relations. By extending Bayesian Theory of Mind to perform value-to-value inference, it shows how an AI could predict unobserved values from known ones, addressing the value generalization problem. It critiques standard inverse reinforcement learning for conflating reward with value and proposes a framework that integrates causal Bayesian networks and model-based RL to capture these instrumental relations. This work lays a foundation for scalable, safe AI value learning, while acknowledging unresolved ethical and philosophical questions.

Abstract

Value learning is a crucial aspect of safe and ethical AI. This is primarily pursued by methods inferring human values from behaviour. However, humans care about much more than we are able to demonstrate through our actions. Consequently, an AI must predict the rest of our seemingly complex values from a limited sample. I call this the value generalization problem. In this paper, I argue that human values have a generative rational structure and that this allows us to solve the value generalization problem. In particular, we can use Bayesian Theory of Mind models to infer human values not only from behaviour, but also from other values. This has been obscured by the widespread use of simple utility functions to represent human values. I conclude that developing generative value-to-value inference is a crucial component of achieving a scalable machine theory of mind.

Machine Theory of Mind and the Structure of Human Values

TL;DR

The paper argues that human values are not reducible to simple utilities but have a rational, generative structure that connects values through instrumental relations. By extending Bayesian Theory of Mind to perform value-to-value inference, it shows how an AI could predict unobserved values from known ones, addressing the value generalization problem. It critiques standard inverse reinforcement learning for conflating reward with value and proposes a framework that integrates causal Bayesian networks and model-based RL to capture these instrumental relations. This work lays a foundation for scalable, safe AI value learning, while acknowledging unresolved ethical and philosophical questions.

Abstract

Value learning is a crucial aspect of safe and ethical AI. This is primarily pursued by methods inferring human values from behaviour. However, humans care about much more than we are able to demonstrate through our actions. Consequently, an AI must predict the rest of our seemingly complex values from a limited sample. I call this the value generalization problem. In this paper, I argue that human values have a generative rational structure and that this allows us to solve the value generalization problem. In particular, we can use Bayesian Theory of Mind models to infer human values not only from behaviour, but also from other values. This has been obscured by the widespread use of simple utility functions to represent human values. I conclude that developing generative value-to-value inference is a crucial component of achieving a scalable machine theory of mind.

Paper Structure

This paper contains 5 sections, 2 equations, 2 figures.

Figures (2)

  • Figure 1: An illustration of Miriam's action and values, and their instrumental relations.
  • Figure 2: A world model with expected causal impact between $o$ and $x_1,...x_n$.