Autonomous Alignment with Human Value on Altruism through Considerate Self-imagination and Theory of Mind

Haibo Tong; Enmeng Lu; Yinqian Sun; Zhengqiang Han; Chao Liu; Feifei Zhao; Yi Zeng

Autonomous Alignment with Human Value on Altruism through Considerate Self-imagination and Theory of Mind

Haibo Tong, Enmeng Lu, Yinqian Sun, Zhengqiang Han, Chao Liu, Feifei Zhao, Yi Zeng

TL;DR

The paper tackles autonomous alignment of AI with human altruistic values by embedding a self-imagination module and Theory of Mind (ToM) to generate intrinsic motivations that balance rescue, safety, and task goals. It introduces a random-imagination framework where multiple imaginary spaces produce $Q_i$ values used to compute negative side effects $R_{nse}$ and empathy incentives $R_{emp}$, integrated with external rewards into $R_{total}$. In a Smash Vat-inspired environment, the approach enables agents to prioritize rescuing humans while minimizing irreversible environmental damage and maintaining task progress, outperforming classic DQN and Empathy DQN baselines. Ablation and hyperparameter analyses demonstrate the necessity and robustness of the intrinsic components, and compatibility tests show the method works with both DNNs and SNNs. The work provides a foundational step toward moral and ethical AI by combining self-imagination with ToM to produce safe, altruistic behavior through intrinsic motivation.

Abstract

With the widespread application of Artificial Intelligence (AI) in human society, enabling AI to autonomously align with human values has become a pressing issue to ensure its sustainable development and benefit to humanity. One of the most important aspects of aligning with human values is the necessity for agents to autonomously make altruistic, safe, and ethical decisions, considering and caring for human well-being. Current AI extremely pursues absolute superiority in certain tasks, remaining indifferent to the surrounding environment and other agents, which has led to numerous safety risks. Altruistic behavior in human society originates from humans' capacity for empathizing others, known as Theory of Mind (ToM), combined with predictive imaginative interactions before taking action to produce thoughtful and altruistic behaviors. Inspired by this, we are committed to endow agents with considerate self-imagination and ToM capabilities, driving them through implicit intrinsic motivations to autonomously align with human altruistic values. By integrating ToM within the imaginative space, agents keep an eye on the well-being of other agents in real time, proactively anticipate potential risks to themselves and others, and make thoughtful altruistic decisions that balance negative effects on the environment. The ancient Chinese story of Sima Guang Smashes the Vat illustrates the moral behavior of the young Sima Guang smashed a vat to save a child who had accidentally fallen into it, which is an excellent reference scenario for this paper. We design an experimental scenario similar to Sima Guang Smashes the Vat and its variants with different complexities, which reflects the trade-offs and comprehensive considerations between self-goals, altruistic rescue, and avoiding negative side effects.

Autonomous Alignment with Human Value on Altruism through Considerate Self-imagination and Theory of Mind

TL;DR

values used to compute negative side effects

and empathy incentives

, integrated with external rewards into

. In a Smash Vat-inspired environment, the approach enables agents to prioritize rescuing humans while minimizing irreversible environmental damage and maintaining task progress, outperforming classic DQN and Empathy DQN baselines. Ablation and hyperparameter analyses demonstrate the necessity and robustness of the intrinsic components, and compatibility tests show the method works with both DNNs and SNNs. The work provides a foundational step toward moral and ethical AI by combining self-imagination with ToM to produce safe, altruistic behavior through intrinsic motivation.

Abstract

Paper Structure (23 sections, 6 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 6 equations, 5 figures, 5 tables, 1 algorithm.

Introduction
Results
The Basic Smash Vat Environment
Experimental Results and Analysis
Experimental Results under Different Environment Variants
Comparison with other methods
Ablation Experiment
Hyperparameter Analysis
The Compatibility on SNN and DNN
Discussion
Methods
Self-imagination Module
Avoid Negative Side Effects
Self-experience based ToM
Integration of Self-imagination Module and Decision-making Network
...and 8 more sections

Figures (5)

Figure 1: The overall framework of our method. The experiment environment is inspired by the ancient Chinese story Sima Guang Smashes the Vat. Self-imagination is implemented using random rewards. Each Q-value $Q_i$ function of different imaginary environment is update base on self experience (the inaction with the real environment). We calculated the side effect penalty $R_{nse}$ term and the empathy incentive term $R_{emp}$ based on $Q_i$ at the same time. The policy network is optimized by integrated reward function $R_{total}$.
Figure 2: The experimental results of different methods in various environments. We use a hammer to indicate that the agent performed a smash action at that position.
Figure 3: Comparison with other methods. For each data point, we calculated the average level of the last 100 training episodes. We conduct 6 experiments with different random seeds and take the average values.
Figure 4: Hyperparameter experiment result. The data processing method is similar to Fig. \ref{['fig_comp']}
Figure 5: The relationship between different baselines.

Autonomous Alignment with Human Value on Altruism through Considerate Self-imagination and Theory of Mind

TL;DR

Abstract

Autonomous Alignment with Human Value on Altruism through Considerate Self-imagination and Theory of Mind

Authors

TL;DR

Abstract

Table of Contents

Figures (5)