Interpretable Next-token Prediction via the Generalized Induction Head
Eunji Kim, Sriya Mantena, Weiwei Yang, Chandan Singh, Sungroh Yoon, Jianfeng Gao
TL;DR
The paper tackles the tension between interpretability and predictive performance in next-token prediction by introducing the Generalized Induction Head Model (GIM), an interpretable, retrieval-based module that operates entirely within the input context using exact and fuzzy matching to suggest next tokens. By integrating GIM with Infini-gram for language modeling and pairing it with linear models for fMRI prediction, the authors show substantial gains: up to 25 percentage points in next-token accuracy over an interpretable baseline and about a 20% improvement in neural response prediction over the best interpretable baseline, narrowing the gap to black-box LLMs. These results demonstrate that mechanistically inspired, auditable components can achieve meaningful performance gains across domains, providing token-level attributions and neuroscience insights while preserving transparency. The work also outlines practical considerations, such as thresholding by effective context size and the efficiency of fuzzy matching, and suggests avenues for extending interpretability through hybrid decoding and broader sequential-domain applications.
Abstract
While large transformer models excel in predictive performance, their lack of interpretability restricts their usefulness in high-stakes domains. To remedy this, we propose the Generalized Induction-Head Model (GIM), an interpretable model for next-token prediction inspired by the observation of "induction heads" in LLMs. GIM is a retrieval-based module that identifies similar sequences in the input context by combining exact n-gram matching and fuzzy matching based on a neural similarity metric. We evaluate GIM in two settings: language modeling and fMRI response prediction. In language modeling, GIM improves next-token prediction by up to 25%p over interpretable baselines, significantly narrowing the gap with black-box LLMs. In an fMRI setting, GIM improves neural response prediction by 20% and offers insights into the language selectivity of the brain. GIM represents a significant step toward uniting interpretability and performance across domains. The code is available at https://github.com/ejkim47/generalized-induction-head.
