Towards falsifiable interpretability research

Matthew L. Leavitt; Ari Morcos

Towards falsifiable interpretability research

Matthew L. Leavitt, Ari Morcos

TL;DR

This paper argues that interpretability research in deep neural networks often relies on intuition rather than falsifiable testing, risking misleading conclusions. It introduces a strongly falsifiable framework for interpretability, analyzes saliency and single-unit methods as case studies of methodological pitfalls, and proposes concrete best practices to improve rigor. Through examples like ablations and distributed representations, the authors demonstrate how to strengthen hypotheses from weak to strong, including causal tests and rigorous baselines. The work aims to shift interpretability toward robust, evidence-based methods that reveal genuine mechanisms in deep networks.

Abstract

Methods for understanding the decisions of and mechanisms underlying deep neural networks (DNNs) typically rely on building intuition by emphasizing sensory or semantic features of individual examples. For instance, methods aim to visualize the components of an input which are "important" to a network's decision, or to measure the semantic properties of single neurons. Here, we argue that interpretability research suffers from an over-reliance on intuition-based approaches that risk-and in some cases have caused-illusory progress and misleading conclusions. We identify a set of limitations that we argue impede meaningful progress in interpretability research, and examine two popular classes of interpretability methods-saliency and single-neuron-based approaches-that serve as case studies for how overreliance on intuition and lack of falsifiability can undermine interpretability research. To address these concerns, we propose a strategy to address these impediments in the form of a framework for strongly falsifiable interpretability research. We encourage researchers to use their intuitions as a starting point to develop and test clear, falsifiable hypotheses, and hope that our framework yields robust, evidence-based interpretability methods that generate meaningful advances in our understanding of DNNs.

Towards falsifiable interpretability research

TL;DR

Abstract

Towards falsifiable interpretability research

TL;DR

Abstract

Paper Structure

Table of Contents