Table of Contents
Fetching ...

NeuroAI for AI Safety

Patrick Mineault, Niccolò Zanichelli, Joanne Zichen Peng, Anton Arkhipov, Eli Bingham, Julian Jara-Ettinger, Emily Mackevicius, Adam Marblestone, Marcelo Mattar, Andrew Payne, Sophia Sanborn, Karen Schroeder, Zenna Tavares, Andreas Tolias, Anthony Zador

TL;DR

This roadmap surveys how neuroscience can inform the design of safer AI, emphasizing a holistic NeuroAI strategy that spans multiple levels of analysis. It categorizes concrete avenues, including reverse-engineering sensory systems (sensory digital twins), building embodied digital twins, developing biophysically detailed models, and crafting robust cognitive architectures, with brain data-informed finetuning, evolutionary curricula, and loss-function inference. The authors assess feasibility, propose evaluation metrics, and outline concrete opportunities and bottlenecks (data, tooling, theory) to guide coordinated research and funding. By advocating a safety-first orientation alongside scalable data-driven methods, the paper argues for cross-disciplinary collaboration to reduce risks in current and future agentic AI systems. Overall, the work provides a structured, multi-pronged blueprint for leveraging brain-inspired principles to enhance AI safety, robustness, and alignment with human values.

Abstract

As AI systems become increasingly powerful, the need for safe AI has become more pressing. Humans are an attractive model for AI safety: as the only known agents capable of general intelligence, they perform robustly even under conditions that deviate significantly from prior experiences, explore the world safely, understand pragmatics, and can cooperate to meet their intrinsic goals. Intelligence, when coupled with cooperation and safety mechanisms, can drive sustained progress and well-being. These properties are a function of the architecture of the brain and the learning algorithms it implements. Neuroscience may thus hold important keys to technical AI safety that are currently underexplored and underutilized. In this roadmap, we highlight and critically evaluate several paths toward AI safety inspired by neuroscience: emulating the brain's representations, information processing, and architecture; building robust sensory and motor systems from imitating brain data and bodies; fine-tuning AI systems on brain data; advancing interpretability using neuroscience methods; and scaling up cognitively-inspired architectures. We make several concrete recommendations for how neuroscience can positively impact AI safety.

NeuroAI for AI Safety

TL;DR

This roadmap surveys how neuroscience can inform the design of safer AI, emphasizing a holistic NeuroAI strategy that spans multiple levels of analysis. It categorizes concrete avenues, including reverse-engineering sensory systems (sensory digital twins), building embodied digital twins, developing biophysically detailed models, and crafting robust cognitive architectures, with brain data-informed finetuning, evolutionary curricula, and loss-function inference. The authors assess feasibility, propose evaluation metrics, and outline concrete opportunities and bottlenecks (data, tooling, theory) to guide coordinated research and funding. By advocating a safety-first orientation alongside scalable data-driven methods, the paper argues for cross-disciplinary collaboration to reduce risks in current and future agentic AI systems. Overall, the work provides a structured, multi-pronged blueprint for leveraging brain-inspired principles to enhance AI safety, robustness, and alignment with human values.

Abstract

As AI systems become increasingly powerful, the need for safe AI has become more pressing. Humans are an attractive model for AI safety: as the only known agents capable of general intelligence, they perform robustly even under conditions that deviate significantly from prior experiences, explore the world safely, understand pragmatics, and can cooperate to meet their intrinsic goals. Intelligence, when coupled with cooperation and safety mechanisms, can drive sustained progress and well-being. These properties are a function of the architecture of the brain and the learning algorithms it implements. Neuroscience may thus hold important keys to technical AI safety that are currently underexplored and underutilized. In this roadmap, we highlight and critically evaluate several paths toward AI safety inspired by neuroscience: emulating the brain's representations, information processing, and architecture; building robust sensory and motor systems from imitating brain data and bodies; fine-tuning AI systems on brain data; advancing interpretability using neuroscience methods; and scaling up cognitively-inspired architectures. We make several concrete recommendations for how neuroscience can positively impact AI safety.

Paper Structure

This paper contains 121 sections, 20 equations, 22 figures.

Figures (22)

  • Figure 1: A framework for AI safety adapted to NeuroAI.
  • Figure 2: Process for evaluating proposals for how neuroscience can impact AI safety
  • Figure 3: An example of an adversarial attack on an image recognition model (ResNet-50 from torchvision). An imperceptible change in the input image leads the model to confidently apply the label microwave to this image of a chihuahua. Inspired by Goodfellow2014-pc.
  • Figure 4: A digital twin of a sensory system seeks primarily to account for the relationship between input $i(t)$ and the state of the brain $s(t)$ as reflected by high-dimensional measurements $m(t)$, potentially taking into account. By contrast, an embodied digital twin (Section \ref{['sec-embodied']}) must account for the full relationship between input, brain state, motor output $o(t)$, as well as the world $w(t)$.
  • Figure 5: Typical layout of a visual digital twin trained on neural data. Images are processed by a common core, which gets read out by a series of weights to predict the firing rate of a set of neurons. This visual input-neuronal output mapping may be modulated by external drives, for example the position of the eye. Both the common core and the readout feature weights are learned. In the following, we distinguish between scaling laws for the core and the readout. From Fu2023-pe, under a CC-BY-NC-ND 4.0 license.
  • ...and 17 more figures