Value Internalization: Learning and Generalizing from Social Reward

Frieda Rong; Max Kleiman-Weiner

Value Internalization: Learning and Generalizing from Social Reward

Frieda Rong, Max Kleiman-Weiner

TL;DR

This work tackles how culturally learned values can persist when social supervision ends by formalizing value internalization through an internalized social reward (ISR) model within an augmented MDP, MDP-SR. During a socialization phase, a caregiver provides social rewards that train the ISR to imitate those rewards internally, enabling ongoing autonomous learning when the caregiver is absent; the agent’s utility is $U = R_e + P R_s + (1-P) R_i$. Empirical results show that ISR-equipped agents avoid unlearning, generalize to out-of-distribution tasks, and extend to prosocial contexts, while incomplete internalization can lead to reward hacking. The findings illuminate mechanisms by which humans might internalize social values and offer a computational path toward aligning AI with human values, by enabling internalization of caregiver-like rewards and reducing reliance on external feedback for continued learning.

Abstract

Social rewards shape human behavior. During development, a caregiver guides a learner's behavior towards culturally aligned goals and values. How do these behaviors persist and generalize when the caregiver is no longer present, and the learner must continue autonomously? Here, we propose a model of value internalization where social feedback trains an internal social reward (ISR) model that generates internal rewards when social rewards are unavailable. Through empirical simulations, we show that an ISR model prevents agents from unlearning socialized behaviors and enables generalization in out-of-distribution tasks. We characterize the implications of incomplete internalization, akin to "reward hacking" on the ISR. Additionally, we show that our model internalizes prosocial behavior in a multi-agent environment. Our work provides a foundation for understanding how humans acquire and generalize values and offers insights for aligning AI with human values.

Value Internalization: Learning and Generalizing from Social Reward

TL;DR

. Empirical results show that ISR-equipped agents avoid unlearning, generalize to out-of-distribution tasks, and extend to prosocial contexts, while incomplete internalization can lead to reward hacking. The findings illuminate mechanisms by which humans might internalize social values and offer a computational path toward aligning AI with human values, by enabling internalization of caregiver-like rewards and reducing reliance on external feedback for continued learning.

Abstract

Paper Structure (11 sections, 1 equation, 7 figures)

This paper contains 11 sections, 1 equation, 7 figures.

Introduction
Related Computational Work
MDPs With Social Rewards
Modeling Value Internalization
Results
Training the ISR model
Continual Learning and Generalization
Internalization Failure: Reward Hacking
Internalization of Prosocial Values
Discussion
Source Code

Figures (7)

Figure 1: The challenge of learning from social rewards. (left) Three example grids from our environment. The goal square is shown in green, and the agent is the red triangle. Three obstacles shown in grey are randomly arranged in each grid. (right) Learning with (blue) and without (green) social rewards. A baseline reinforcement learning agent learns to navigate to the green square when the caregiver is present. The goal-directed behavior is unlearned once the caregiver leaves (dotted vertical line at 6K episodes for the green trace). Traces averaged over ten seeds and smoothed. Bands show the min and max.
Figure 2: Agent architectures. (left) Standard view of reinforcement learning with extrinsic reward from the environment. (right) Learning from social rewards. Dotted lines indicate that the caregiver and the social rewards they give are not always present. When present, social rewards affect the policy as well as train an internalized social reward model (ISR) that provides internal rewards when the caregiver is absent.
Figure 3: Training the internalized social reward (ISR) model. (left) Example training curve for the ISR model trained on social rewards. The model quickly converges with no measurable gap between train and test performance. (right) ISR test loss continually decreases when trained with more social rewards. Results averaged over ten seeds and smoothed. Error bars are the standard error.
Figure 4: The ISR model prevents unlearning and enables out-of-distribution (OOD) generalization (left) Agents first learn with social rewards from the caregiver (blue). After 6K episodes, the caregiver is removed (vertical dotted line). Without the ISR model, the agent quickly unlearns the behavior (green). The ISR model prevents unlearning with no measurable loss in performance (red). Results averaged over ten seeds and smoothed. Bands show the min and max. (right) Comparing OOD generalization where models were trained with one block and must generalize to five. The ISR performance was significantly greater than the frozen baseline ($p<0.05$, t-test) and not significantly different from the oracle ($p=0.32$, t-test). See text for model descriptions. Results averaged across ten seeds and smoothed. Error bars show the standard error of the mean.
Figure 5: Out of distribution (OOD) generalization on custom environments. Agents were trained with only a single block and evaluated on their ability to generalize OOD to the above five block tasks. Starting location of the goal and agent was sampled randomly. The ISR significantly outperforms the frozen model ($p < 0.001$, $t=-3.66$, linear mixed-effect model with environment as a fixed effect) but did not significantly differ from the oracle ($p=.27$, $t=1.12$, linear mixed-effect model with environment as a fixed effect). Results are averaged over ten seeds and smoothed. Error bars are standard errors.
...and 2 more figures

Value Internalization: Learning and Generalizing from Social Reward

TL;DR

Abstract

Value Internalization: Learning and Generalizing from Social Reward

Authors

TL;DR

Abstract

Table of Contents

Figures (7)