Table of Contents
Fetching ...

Combining Theory of Mind and Kindness for Self-Supervised Human-AI Alignment

Joshua T. S. Hewson

TL;DR

A novel human-inspired approach which aims to address various concerns and help align competing objectives in the AI race is proposed.

Abstract

As artificial intelligence (AI) becomes deeply integrated into critical infrastructures and everyday life, ensuring its safe deployment is one of humanity's most urgent challenges. Current AI models prioritize task optimization over safety, leading to risks of unintended harm. These risks are difficult to address due to the competing interests of governments, businesses, and advocacy groups, all of which have different priorities in the AI race. Current alignment methods, such as reinforcement learning from human feedback (RLHF), focus on extrinsic behaviors without instilling a genuine understanding of human values. These models are vulnerable to manipulation and lack the social intelligence necessary to infer the mental states and intentions of others, raising concerns about their ability to safely and responsibly make important decisions in complex and novel situations. Furthermore, the divergence between extrinsic and intrinsic motivations in AI introduces the risk of deceptive or harmful behaviors, particularly as systems become more autonomous and intelligent. We propose a novel human-inspired approach which aims to address these various concerns and help align competing objectives.

Combining Theory of Mind and Kindness for Self-Supervised Human-AI Alignment

TL;DR

A novel human-inspired approach which aims to address various concerns and help align competing objectives in the AI race is proposed.

Abstract

As artificial intelligence (AI) becomes deeply integrated into critical infrastructures and everyday life, ensuring its safe deployment is one of humanity's most urgent challenges. Current AI models prioritize task optimization over safety, leading to risks of unintended harm. These risks are difficult to address due to the competing interests of governments, businesses, and advocacy groups, all of which have different priorities in the AI race. Current alignment methods, such as reinforcement learning from human feedback (RLHF), focus on extrinsic behaviors without instilling a genuine understanding of human values. These models are vulnerable to manipulation and lack the social intelligence necessary to infer the mental states and intentions of others, raising concerns about their ability to safely and responsibly make important decisions in complex and novel situations. Furthermore, the divergence between extrinsic and intrinsic motivations in AI introduces the risk of deceptive or harmful behaviors, particularly as systems become more autonomous and intelligent. We propose a novel human-inspired approach which aims to address these various concerns and help align competing objectives.

Paper Structure

This paper contains 30 sections, 7 equations, 2 figures, 7 algorithms.

Figures (2)

  • Figure 1: The names associated with the messages is swapped, so that the model is trained on the rewards that the target would have received.
  • Figure 2: Blue heads correspond to embodied behavior, and are trained using reward. Red heads correspond to disembodied behavior, and are trained using error. Green heads correspond to perception, and are trained using both reward and error.