Learning Human-like Representations to Enable Learning Human Values

Andrea Wynn; Ilia Sucholutsky; Thomas L. Griffiths

Learning Human-like Representations to Enable Learning Human Values

Andrea Wynn, Ilia Sucholutsky, Thomas L. Griffiths

TL;DR

This work explores the effects of representational alignment between humans and AI agents on learning human values, and demonstrates that representational alignment enables both safe exploration and improved generalization when learning human values.

Abstract

How can we build AI systems that can learn any set of individual human values both quickly and safely, avoiding causing harm or violating societal standards for acceptable behavior during the learning process? We explore the effects of representational alignment between humans and AI agents on learning human values. Making AI systems learn human-like representations of the world has many known benefits, including improving generalization, robustness to domain shifts, and few-shot learning performance. We demonstrate that this kind of representational alignment can also support safely learning and exploring human values in the context of personalization. We begin with a theoretical prediction, show that it applies to learning human morality judgments, then show that our results generalize to ten different aspects of human values -- including ethics, honesty, and fairness -- training AI agents on each set of values in a multi-armed bandit setting, where rewards reflect human value judgments over the chosen action. Using a set of textual action descriptions, we collect value judgments from humans, as well as similarity judgments from both humans and multiple language models, and demonstrate that representational alignment enables both safe exploration and improved generalization when learning human values.

Learning Human-like Representations to Enable Learning Human Values

TL;DR

Abstract

Paper Structure (30 sections, 1 theorem, 27 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 30 sections, 1 theorem, 27 equations, 9 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Problem Formulation
Theory
Synthetic Experiments
Learning Human Morality Judgments
Embedding Models
Results
Representational Alignment Supports Learning Multiple Human Values
Discussion
Appendix
Preliminaries
Introduction to Kernel Methods
Kernel Regression
Support Vector Regression
...and 15 more sections

Key Result

Theorem A.1

Let $\{ \phi(x_i) \}_{i=1}^n \subset \mathbb{R}^m$ and $\{ y_i \}_{i=1}^n \subset \mathbb{R}$. Then there exist $\{\alpha_i\}_{i=1}^n \subset \mathbb{R}$ such that the minimum norm minimizer $w^*$ for the loss: lies in the span of the samples $\{\phi(x_i)\}_{i=1}^n$, i.e.:

Figures (9)

Figure 1: A visualization of our experimental setup. Representation spaces are modeled via pairwise similarity judgments given by language models and humans over the same set of stimuli. A machine learning agent takes such a representation space and tries to learn a human value function over those representations. We simulate personalization (the process of learning the value function), evaluating the agent on safe exploration, and evaluate the agent's ability to generalize to unseen examples.
Figure 2: Agent performance in simulated experiments, plotted against representational alignment.
Figure 3: We evaluate agents on both personalization (safe exploration) and generalization ability for 100 experiments each and observe the results from both phases. Results are shown for all models.
Figure 4: Results of running the experiment across 10 different human values. Representational alignment vs. mean reward for all models (including best fit lines) for both personalization and generalization.
Figure 5: Results from running the embedding model experiment while gradually increasing alignment with human representations via linear interpolation towards the human similarity matrix.
...and 4 more figures

Theorems & Definitions (1)

Theorem A.1

Learning Human-like Representations to Enable Learning Human Values

TL;DR

Abstract

Learning Human-like Representations to Enable Learning Human Values

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (1)