Value Internalization: Learning and Generalizing from Social Reward
Frieda Rong, Max Kleiman-Weiner
TL;DR
This work tackles how culturally learned values can persist when social supervision ends by formalizing value internalization through an internalized social reward (ISR) model within an augmented MDP, MDP-SR. During a socialization phase, a caregiver provides social rewards that train the ISR to imitate those rewards internally, enabling ongoing autonomous learning when the caregiver is absent; the agent’s utility is $U = R_e + P R_s + (1-P) R_i$. Empirical results show that ISR-equipped agents avoid unlearning, generalize to out-of-distribution tasks, and extend to prosocial contexts, while incomplete internalization can lead to reward hacking. The findings illuminate mechanisms by which humans might internalize social values and offer a computational path toward aligning AI with human values, by enabling internalization of caregiver-like rewards and reducing reliance on external feedback for continued learning.
Abstract
Social rewards shape human behavior. During development, a caregiver guides a learner's behavior towards culturally aligned goals and values. How do these behaviors persist and generalize when the caregiver is no longer present, and the learner must continue autonomously? Here, we propose a model of value internalization where social feedback trains an internal social reward (ISR) model that generates internal rewards when social rewards are unavailable. Through empirical simulations, we show that an ISR model prevents agents from unlearning socialized behaviors and enables generalization in out-of-distribution tasks. We characterize the implications of incomplete internalization, akin to "reward hacking" on the ISR. Additionally, we show that our model internalizes prosocial behavior in a multi-agent environment. Our work provides a foundation for understanding how humans acquire and generalize values and offers insights for aligning AI with human values.
