Table of Contents
Fetching ...

Mechanistic?

Naomi Saphra, Sarah Wiegreffe

TL;DR

<3-5 sentence high-level summary> The paper examines the term 'mechanistic interpretability' and its multiple meanings across narrow/broad technical and cultural uses. It traces the historical development from traditional NLP interpretability to a distinct mechanistic movement, highlighting the circuit-centric focus and later broad adoption. It analyzes four definitions, waaronder causal mechanisms and broader internal-change inquiries, and chronicles the two LM interpretability communities, their clashes, and evolving convergence. The authors advocate clearer vocabulary and cross-community collaboration to advance understanding of language models.

Abstract

The rise of the term "mechanistic interpretability" has accompanied increasing interest in understanding neural models -- particularly language models. However, this jargon has also led to a fair amount of confusion. So, what does it mean to be "mechanistic"? We describe four uses of the term in interpretability research. The most narrow technical definition requires a claim of causality, while a broader technical definition allows for any exploration of a model's internals. However, the term also has a narrow cultural definition describing a cultural movement. To understand this semantic drift, we present a history of the NLP interpretability community and the formation of the separate, parallel "mechanistic" interpretability community. Finally, we discuss the broad cultural definition -- encompassing the entire field of interpretability -- and why the traditional NLP interpretability community has come to embrace it. We argue that the polysemy of "mechanistic" is the product of a critical divide within the interpretability community.

Mechanistic?

TL;DR

<3-5 sentence high-level summary> The paper examines the term 'mechanistic interpretability' and its multiple meanings across narrow/broad technical and cultural uses. It traces the historical development from traditional NLP interpretability to a distinct mechanistic movement, highlighting the circuit-centric focus and later broad adoption. It analyzes four definitions, waaronder causal mechanisms and broader internal-change inquiries, and chronicles the two LM interpretability communities, their clashes, and evolving convergence. The authors advocate clearer vocabulary and cross-community collaboration to advance understanding of language models.

Abstract

The rise of the term "mechanistic interpretability" has accompanied increasing interest in understanding neural models -- particularly language models. However, this jargon has also led to a fair amount of confusion. So, what does it mean to be "mechanistic"? We describe four uses of the term in interpretability research. The most narrow technical definition requires a claim of causality, while a broader technical definition allows for any exploration of a model's internals. However, the term also has a narrow cultural definition describing a cultural movement. To understand this semantic drift, we present a history of the NLP interpretability community and the formation of the separate, parallel "mechanistic" interpretability community. Finally, we discuss the broad cultural definition -- encompassing the entire field of interpretability -- and why the traditional NLP interpretability community has come to embrace it. We argue that the polysemy of "mechanistic" is the product of a critical divide within the interpretability community.

Paper Structure

This paper contains 17 sections.