Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience

Zhonghao He; Jascha Achterberg; Katie Collins; Kevin Nejad; Danyal Akarca; Yinzhu Yang; Wes Gurnee; Ilia Sucholutsky; Yuhan Tang; Rebeca Ianov; George Ogden; Chole Li; Kai Sandbrink; Stephen Casper; Anna Ivanova; Grace W. Lindsay

Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience

Zhonghao He, Jascha Achterberg, Katie Collins, Kevin Nejad, Danyal Akarca, Yinzhu Yang, Wes Gurnee, Ilia Sucholutsky, Yuhan Tang, Rebeca Ianov, George Ogden, Chole Li, Kai Sandbrink, Stephen Casper, Anna Ivanova, Grace W. Lindsay

TL;DR

The paper tackles the challenge of interpreting ultra-large neural networks by importing Marr's three-level framework (computational, algorithmic/representational, implementation) from neuroscience to AI interpretability. It argues for a multilevel, cross-disciplinary approach that links behavior, representations, and neural substrates, and it surveys tools from neuroethology, psychophysics, Bayesian cognition, decoding/encoding models, and neural geometry as applicable at each level. Through the deception case study, it demonstrates how level-specific questions can guide experiments, analyses, and interventions to understand and control AI behaviors. The work advocates a principled, integrated research program that leverages cross-field insights to improve understanding, predictability, and editability of intelligent systems, with practical implications for safety, reliability, and trust.

Abstract

As deep learning systems are scaled up to many billions of parameters, relating their internal structure to external behaviors becomes very challenging. Although daunting, this problem is not new: Neuroscientists and cognitive scientists have accumulated decades of experience analyzing a particularly complex system - the brain. In this work, we argue that interpreting both biological and artificial neural systems requires analyzing those systems at multiple levels of analysis, with different analytic tools for each level. We first lay out a joint grand challenge among scientists who study the brain and who study artificial neural networks: understanding how distributed neural mechanisms give rise to complex cognition and behavior. We then present a series of analytical tools that can be used to analyze biological and artificial neural systems, organizing those tools according to Marr's three levels of analysis: computation/behavior, algorithm/representation, and implementation. Overall, the multilevel interpretability framework provides a principled way to tackle neural system complexity; links structure, computation, and behavior; clarifies assumptions and research priorities at each level; and paves the way toward a unified effort for understanding intelligent systems, may they be biological or artificial.

Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience

TL;DR

Abstract

Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience

Authors

TL;DR

Abstract

Table of Contents

Figures (2)