Toward Mechanistic Explanation of Deductive Reasoning in Language Models

Davide Maltoni; Matteo Ferrara

Toward Mechanistic Explanation of Deductive Reasoning in Language Models

Davide Maltoni, Matteo Ferrara

TL;DR

This paper tackles the problem of understanding how language models perform deductive reasoning beyond surface statistics. It trains a tiny, non-pretrained decoder-only model with Chain-of-Thought prompting on a symbol-based Horn-clause task and uses mechanistic interpretability tools to reveal internal circuits. The authors find that induction heads instantiate rule completion and rule chaining, forming a minimal two-layer mechanism that generalizes to unseen instances. The work demonstrates that symbolic-like rule learning is achievable by LMs and provides practical interpretability methods, including a truncated pseudoinverse, with implications for scaling to more complex reasoning tasks.

Abstract

Recent large language models have demonstrated relevant capabilities in solving problems that require logical reasoning; however, the corresponding internal mechanisms remain largely unexplored. In this paper, we show that a small language model can solve a deductive reasoning task by learning the underlying rules (rather than operating as a statistical learner). A low-level explanation of its internal representations and computational circuits is then provided. Our findings reveal that induction heads play a central role in the implementation of the rule completion and rule chaining steps involved in the logical inference required by the task.

Toward Mechanistic Explanation of Deductive Reasoning in Language Models

TL;DR

Abstract

Toward Mechanistic Explanation of Deductive Reasoning in Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)