Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
Francesco Ortu, Zhijing Jin, Diego Doimo, Mrinmaya Sachan, Alberto Cazzaniga, Bernhard Schölkopf
TL;DR
This work introduces the competition_of_mechanisms framework to study how multiple internal mechanisms in LLMs interact to determine the final prediction, focusing on factual knowledge recall versus counterfactual redefinition. It leverages two interpretable tools—logit_inspection of the residual stream and attention_modification—to locate where and how competition occurs, from macroscopic layer-level dynamics to microscopic attention-head effects. The study finds that late-layer attention blocks and a handful of specialized heads predominantly control the competition, with localized attention entries capable of substantially shifting outcomes toward factual recall. By experimentally perturbing attention and analyzing word-choice similarities, the authors demonstrate both the manipulability of the mechanisms and the sensitivity of predictions to prompt structure, offering insights for interpretability and safety in LLMs.
Abstract
Interpretability research aims to bridge the gap between empirical success and our scientific understanding of the inner workings of large language models (LLMs). However, most existing research focuses on analyzing a single mechanism, such as how models copy or recall factual knowledge. In this work, we propose a formulation of competition of mechanisms, which focuses on the interplay of multiple mechanisms instead of individual mechanisms and traces how one of them becomes dominant in the final prediction. We uncover how and where mechanisms compete within LLMs using two interpretability methods: logit inspection and attention modification. Our findings show traces of the mechanisms and their competition across various model components and reveal attention positions that effectively control the strength of certain mechanisms. Code: https://github.com/francescortu/comp-mech. Data: https://huggingface.co/datasets/francescortu/comp-mech.
