Toward Mechanistic Explanation of Deductive Reasoning in Language Models
Davide Maltoni, Matteo Ferrara
TL;DR
This paper tackles the problem of understanding how language models perform deductive reasoning beyond surface statistics. It trains a tiny, non-pretrained decoder-only model with Chain-of-Thought prompting on a symbol-based Horn-clause task and uses mechanistic interpretability tools to reveal internal circuits. The authors find that induction heads instantiate rule completion and rule chaining, forming a minimal two-layer mechanism that generalizes to unseen instances. The work demonstrates that symbolic-like rule learning is achievable by LMs and provides practical interpretability methods, including a truncated pseudoinverse, with implications for scaling to more complex reasoning tasks.
Abstract
Recent large language models have demonstrated relevant capabilities in solving problems that require logical reasoning; however, the corresponding internal mechanisms remain largely unexplored. In this paper, we show that a small language model can solve a deductive reasoning task by learning the underlying rules (rather than operating as a statistical learner). A low-level explanation of its internal representations and computational circuits is then provided. Our findings reveal that induction heads play a central role in the implementation of the rule completion and rule chaining steps involved in the logical inference required by the task.
