Table of Contents
Fetching ...

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

Marc Carauleanu, Michael Vaiana, Judd Rosenblatt, Cameron Berg, Diogo Schwerz de Lucena

TL;DR

The paper addresses the challenge of deceptive AI by introducing Self-Other Overlap (SOO) fine-tuning, a neuroscience-inspired approach that minimizes the divergence between self- and other-representations during reasoning. SOO is implemented via a loss on the latent activations, $D(A_{\text{self}}, A_{\text{other}})$, optimized with dual forward passes and LoRA adapters across LLMs (e.g., Mistral-7B-Instruct-v0.2, Gemma-2-27B-it, CalmeRys-78B-Orpo-v0.1) and a two-agent MADDPG RL setting. Empirically, SOO dramatically reduces deceptive responses (e.g., from $73.6\%$ to $17.2\%$ for Mistral, $100\%$ to $9.3\%$ for Gemma, and $100\%$ to $2.7\%$ for CalmeRys) with minimal degradation in MT-Bench performance, while latent-SOO metrics show meaningful alignment in selected layers. Generalization tests across seven scenarios, plus Treasure Hunt and Escape Room extensions, indicate robust but scenario-dependent reductions in deception, and RL results show that SOO can classify and restructure agent behavior toward honesty. The work suggests SOO as a scalable, architecture-agnostic addition to AI safety toolkits, potentially synergizing with RLHF and Constitutional AI, though broader, adversarial testing and long-term effects warrant further study.

Abstract

As AI systems increasingly make critical decisions, deceptive AI poses a significant challenge to trust and safety. We present Self-Other Overlap (SOO) fine-tuning, a promising approach in AI Safety that could substantially improve our ability to build honest artificial intelligence. Inspired by cognitive neuroscience research on empathy, SOO aims to align how AI models represent themselves and others. Our experiments on LLMs with 7B, 27B, and 78B parameters demonstrate SOO's efficacy: deceptive responses of Mistral-7B-Instruct-v0.2 dropped from 73.6% to 17.2% with no observed reduction in general task performance, while in Gemma-2-27b-it and CalmeRys-78B-Orpo-v0.1 deceptive responses were reduced from 100% to 9.3% and 2.7%, respectively, with a small impact on capabilities. In reinforcement learning scenarios, SOO-trained agents showed significantly reduced deceptive behavior. SOO's focus on contrastive self and other-referencing observations offers strong potential for generalization across AI architectures. While current applications focus on language models and simple RL environments, SOO could pave the way for more trustworthy AI in broader domains. Ethical implications and long-term effects warrant further investigation, but SOO represents a significant step forward in AI safety research.

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

TL;DR

The paper addresses the challenge of deceptive AI by introducing Self-Other Overlap (SOO) fine-tuning, a neuroscience-inspired approach that minimizes the divergence between self- and other-representations during reasoning. SOO is implemented via a loss on the latent activations, , optimized with dual forward passes and LoRA adapters across LLMs (e.g., Mistral-7B-Instruct-v0.2, Gemma-2-27B-it, CalmeRys-78B-Orpo-v0.1) and a two-agent MADDPG RL setting. Empirically, SOO dramatically reduces deceptive responses (e.g., from to for Mistral, to for Gemma, and to for CalmeRys) with minimal degradation in MT-Bench performance, while latent-SOO metrics show meaningful alignment in selected layers. Generalization tests across seven scenarios, plus Treasure Hunt and Escape Room extensions, indicate robust but scenario-dependent reductions in deception, and RL results show that SOO can classify and restructure agent behavior toward honesty. The work suggests SOO as a scalable, architecture-agnostic addition to AI safety toolkits, potentially synergizing with RLHF and Constitutional AI, though broader, adversarial testing and long-term effects warrant further study.

Abstract

As AI systems increasingly make critical decisions, deceptive AI poses a significant challenge to trust and safety. We present Self-Other Overlap (SOO) fine-tuning, a promising approach in AI Safety that could substantially improve our ability to build honest artificial intelligence. Inspired by cognitive neuroscience research on empathy, SOO aims to align how AI models represent themselves and others. Our experiments on LLMs with 7B, 27B, and 78B parameters demonstrate SOO's efficacy: deceptive responses of Mistral-7B-Instruct-v0.2 dropped from 73.6% to 17.2% with no observed reduction in general task performance, while in Gemma-2-27b-it and CalmeRys-78B-Orpo-v0.1 deceptive responses were reduced from 100% to 9.3% and 2.7%, respectively, with a small impact on capabilities. In reinforcement learning scenarios, SOO-trained agents showed significantly reduced deceptive behavior. SOO's focus on contrastive self and other-referencing observations offers strong potential for generalization across AI architectures. While current applications focus on language models and simple RL environments, SOO could pave the way for more trustworthy AI in broader domains. Ethical implications and long-term effects warrant further investigation, but SOO represents a significant step forward in AI safety research.

Paper Structure

This paper contains 21 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Illustration of SOO Loss Calculation: The Mean Squared Error (MSE) between A$_{self}$ and A$_{other}$ (activations at the output of the self_attn.o_proj module at a specified hidden layer) represents the SOO Loss, guiding the model to align its self and other-referencing activations.
  • Figure 2: Comparison of agent behavior between SOO Fine-Tuning, Deceptive Baseline, and Honest Baseline. The green dot shows the goal landmark and the black dot shows the fake landmark. The blue circle shading shows the path taken by the blue agent, and red circle shade shows the path taken by the red agent.
  • Figure 3: Average Count of Deceptive Actions Given Thresholds (8 random seeds) for SOO Fine-Tuning (with SD), Deceptive Baseline, and Honest Baseline.
  • Figure 4: Behavioral difference (Mean ± SD) between SOO Fine-Tuning, Deceptive Baseline, and Honest Baseline.