Table of Contents
Fetching ...

Automatically Finding Rule-Based Neurons in OthelloGPT

Aditya Singh, Zihang Wen, Srujananjali Medicherla, Adam Karvonen, Can Rager

TL;DR

The paper tackles interpretability of a transformer trained to predict legal moves in Othello by seeking rule-based neuron explanations. It introduces an automated framework that trains regression and binary decision trees to predict neuron activations from board-state features, extracts DNFs, and surfaces implementing neurons for given game queries. Empirical results show that roughly half of layer-5 neurons are well-described by compact rule trees ($R^2 > 0.7$ for 913/2048), and causal interventions reveal targeted pattern-specific degradation up to 5-10x. An open-source Python tool maps rule-based game behaviors to implementing neurons, providing a reproducible benchmark for testing interpretability methods against known ground-truth structure in a real domain.

Abstract

OthelloGPT, a transformer trained to predict valid moves in Othello, provides an ideal testbed for interpretability research. The model is complex enough to exhibit rich computational patterns, yet grounded in rule-based game logic that enables meaningful reverse-engineering. We present an automated approach based on decision trees to identify and interpret MLP neurons that encode rule-based game logic. Our method trains regression decision trees to map board states to neuron activations, then extracts decision paths where neurons are highly active to convert them into human-readable logical forms. These descriptions reveal highly interpretable patterns; for instance, neurons that specifically detect when diagonal moves become legal. Our findings suggest that roughly half of the neurons in layer 5 can be accurately described by compact, rule-based decision trees ($R^2 > 0.7$ for 913 of 2,048 neurons), while the remainder likely participate in more distributed or non-rule-based computations. We verify the causal relevance of patterns identified by our decision trees through targeted interventions. For a specific square, for specific game patterns, we ablate neurons corresponding to those patterns and find an approximately 5-10 fold stronger degradation in the model's ability to predict legal moves along those patterns compared to control patterns. To facilitate future work, we provide a Python tool that maps rule-based game behaviors to their implementing neurons, serving as a resource for researchers to test whether their interpretability methods recover meaningful computational structures.

Automatically Finding Rule-Based Neurons in OthelloGPT

TL;DR

The paper tackles interpretability of a transformer trained to predict legal moves in Othello by seeking rule-based neuron explanations. It introduces an automated framework that trains regression and binary decision trees to predict neuron activations from board-state features, extracts DNFs, and surfaces implementing neurons for given game queries. Empirical results show that roughly half of layer-5 neurons are well-described by compact rule trees ( for 913/2048), and causal interventions reveal targeted pattern-specific degradation up to 5-10x. An open-source Python tool maps rule-based game behaviors to implementing neurons, providing a reproducible benchmark for testing interpretability methods against known ground-truth structure in a real domain.

Abstract

OthelloGPT, a transformer trained to predict valid moves in Othello, provides an ideal testbed for interpretability research. The model is complex enough to exhibit rich computational patterns, yet grounded in rule-based game logic that enables meaningful reverse-engineering. We present an automated approach based on decision trees to identify and interpret MLP neurons that encode rule-based game logic. Our method trains regression decision trees to map board states to neuron activations, then extracts decision paths where neurons are highly active to convert them into human-readable logical forms. These descriptions reveal highly interpretable patterns; for instance, neurons that specifically detect when diagonal moves become legal. Our findings suggest that roughly half of the neurons in layer 5 can be accurately described by compact, rule-based decision trees ( for 913 of 2,048 neurons), while the remainder likely participate in more distributed or non-rule-based computations. We verify the causal relevance of patterns identified by our decision trees through targeted interventions. For a specific square, for specific game patterns, we ablate neurons corresponding to those patterns and find an approximately 5-10 fold stronger degradation in the model's ability to predict legal moves along those patterns compared to control patterns. To facilitate future work, we provide a Python tool that maps rule-based game behaviors to their implementing neurons, serving as a resource for researchers to test whether their interpretability methods recover meaningful computational structures.

Paper Structure

This paper contains 27 sections, 3 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Overview of neuron interpretation pipeline. Board state features are used as training data for decision trees to predict neuron activations (A, B). Note: small circles in A and B are all possible legal moves. High activation decision paths are then extracted to obtain logical rules for each neuron (C, D). We provide an interactive visualization of decision trees in this https://colab.research.google.com/drive/1kKLj9c3elB0yjJoBZnkHqFl3ZWygP6zw.
  • Figure 2: Comparison of decision trees to other baselines. A We evaluate regression methods on $R^2$ scores and B classification methods on $F1$ scores. C, D: We show the number of interpretable rule-based neurons with score cutoffs of 0.7, 0.8, and 0.9
  • Figure 3: We evaluate all methods on two contrastive probe feature metrics. A: containment metric. B: Jaccard metric.
  • Figure 4: Layer-wise rule-based neuron (cutoff of 0.7) intervention. A: Accuracy of intervention. B: KL divergence between before and after intervention. The dashed lines show the ablation of board-state related layers ([0,1,2,3,4]) and valid-move related layers ([5,6]) from the previous literature.
  • Figure 5: Fine-grained intervention. A: Intervention pattern example: valid move via diagonal pattern. B: Control pattern example: valid move via diagonal pattern. Note: small circles in A and B are all possible legal moves. C, D: Output Probabilities of OthelloGPT output on a game example where H3 is legal via diagonal pattern. C: Before intervention. D: After intervention, the probability for the intervened square H3 drops to in between the legal and illegal square probabilities. Run more interventions in the https://colab.research.google.com/drive/1kKLj9c3elB0yjJoBZnkHqFl3ZWygP6zw
  • ...and 6 more figures