Table of Contents
Fetching ...

Brain-Model Evaluations Need the NeuroAI Turing Test

Jenelle Feather, Meenakshi Khosla, N. Apurva Ratan Murty, Aran Nayebi

TL;DR

The paper argues that evaluating brain-inspired AI requires more than behavioral indistinguishability and proposes the NeuroAI Turing Test, a benchmark that enforces both behavioral alignment and representational convergence to brain data within inter-subject variability. It formalizes the test by defining a dataset $\mathcal{D}$, constructing inter-organism and model-organism distance sets, and requiring convergence in distribution between these sets under a chosen similarity metric $\mathcal{M}$ with significance $\alpha$, while correcting for noise. The authors review the current state, outline alternate notions of brain-likeness, discuss trade-offs with interpretability and safety, address data and sampling limitations, and argue for the test's achievability through progressive Milestones and frontier datasets. They conclude that a rigorous, flexible benchmark centered on both behavior and internal representations can unify AI and neuroscience, driving the development of truly brain-like AI with wide-ranging scientific and practical impact.

Abstract

What makes an artificial system a good model of intelligence? The classical test proposed by Alan Turing focuses on behavior, requiring that an artificial agent's behavior be indistinguishable from that of a human. While behavioral similarity provides a strong starting point, two systems with very different internal representations can produce the same outputs. Thus, in modeling biological intelligence, the field of NeuroAI often aims to go beyond behavioral similarity and achieve representational convergence between a model's activations and the measured activity of a biological system. This position paper argues that the standard definition of the Turing Test is incomplete for NeuroAI, and proposes a stronger framework called the ``NeuroAI Turing Test'', a benchmark that extends beyond behavior alone and \emph{additionally} requires models to produce internal neural representations that are empirically indistinguishable from those of a brain up to measured individual variability, i.e. the differences between a computational model and the brain is no more than the difference between one brain and another brain. While the brain is not necessarily the ceiling of intelligence, it remains the only universally agreed-upon example, making it a natural reference point for evaluating computational models. By proposing this framework, we aim to shift the discourse from loosely defined notions of brain inspiration to a systematic and testable standard centered on both behavior and internal representations, providing a clear benchmark for neuroscientific modeling and AI development.

Brain-Model Evaluations Need the NeuroAI Turing Test

TL;DR

The paper argues that evaluating brain-inspired AI requires more than behavioral indistinguishability and proposes the NeuroAI Turing Test, a benchmark that enforces both behavioral alignment and representational convergence to brain data within inter-subject variability. It formalizes the test by defining a dataset , constructing inter-organism and model-organism distance sets, and requiring convergence in distribution between these sets under a chosen similarity metric with significance , while correcting for noise. The authors review the current state, outline alternate notions of brain-likeness, discuss trade-offs with interpretability and safety, address data and sampling limitations, and argue for the test's achievability through progressive Milestones and frontier datasets. They conclude that a rigorous, flexible benchmark centered on both behavior and internal representations can unify AI and neuroscience, driving the development of truly brain-like AI with wide-ranging scientific and practical impact.

Abstract

What makes an artificial system a good model of intelligence? The classical test proposed by Alan Turing focuses on behavior, requiring that an artificial agent's behavior be indistinguishable from that of a human. While behavioral similarity provides a strong starting point, two systems with very different internal representations can produce the same outputs. Thus, in modeling biological intelligence, the field of NeuroAI often aims to go beyond behavioral similarity and achieve representational convergence between a model's activations and the measured activity of a biological system. This position paper argues that the standard definition of the Turing Test is incomplete for NeuroAI, and proposes a stronger framework called the ``NeuroAI Turing Test'', a benchmark that extends beyond behavior alone and \emph{additionally} requires models to produce internal neural representations that are empirically indistinguishable from those of a brain up to measured individual variability, i.e. the differences between a computational model and the brain is no more than the difference between one brain and another brain. While the brain is not necessarily the ceiling of intelligence, it remains the only universally agreed-upon example, making it a natural reference point for evaluating computational models. By proposing this framework, we aim to shift the discourse from loosely defined notions of brain inspiration to a systematic and testable standard centered on both behavior and internal representations, providing a clear benchmark for neuroscientific modeling and AI development.

Paper Structure

This paper contains 27 sections, 17 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The complete NeuroAI Turing Test reflects the similarity of both the behavior of an artificial system and the similarity of the internal representations.
  • Figure 2: Possible outcomes of a NeuroAI Turing Test (brain-brain similarity). Each bar is for a different model. The model similarity measure and the brain-brain similarity (NeuroAI Turing Test) are both corrected by the square root of the product of the internal and mapping consistencies, constituting the "Statistical Noise Ceiling" (see Appendix \ref{['sec:methods-interanimal']} for details). Different interpretations arise from the relationship between the model similarity and the distribution of the brain-brain similarity. Researchers should report both values to ensure that a benchmark is not saturated according to brain-brain similarity. Although this figure focuses on the alignment of internal representations, similar comparisons should be reported for behavioral tests.
  • Figure 3: The NeuroAI Turing Test on the classic HvM dataset Majaj2015 with different metrics. Neural predictivity is shown for a ResNet-18 inspired feedforward network from nayebi2022. The linear mapping was performed on multiple timepoints with different numbers of PLS components, and each dataset and measure has a separate value for the NeuroAI Turing Test. These data suggest that at least for HvM, we have reasonably saturated this benchmark and should choose other ones for primate vision.
  • Figure 4: Examples of previous studies using behavioral and representational similarity tests. Artificial models have reached biological behavior and representational similarity in some datasets but not others.

Theorems & Definitions (1)

  • Definition 3.1: Convergence in Distribution.