Examining Two Hop Reasoning Through Information Content Scaling
David Johnston, Nora Belrose
TL;DR
The paper addresses why latent two-hop reasoning in transformers is hard by introducing information content scaling as a quantitative interpretability tool. It builds synthetic two-hop QA datasets, trains Transformer models across sizes with muP, and measures capacity via dataset entropy and effective losses, comparing recurrent, two-function, and independent memorization algorithms. Findings indicate two-hop QA is best explained by a two-function composition memory strategy with capacity near $2$ bits per parameter, while chain-of-thought reasoning greatly improves efficiency; probing methods, however, provide weaker signals. The work demonstrates that information content scaling can complement traditional interpretability techniques, though applying it broadly faces practical challenges and some results depend on dataset and hyperparameter choices. Overall, the study clarifies algorithmic distinctions in two-hop reasoning and highlights the nuanced relationship between capacity, generalization, and interpretability in transformers.
Abstract
Prior work has found that transformers have an inconsistent ability to learn to answer latent two-hop questions -- questions of the form "Who is Bob's mother's boss?" We study why this is the case by examining how transformers' capacity to learn datasets of two-hop questions and answers (two-hop QA) scales with their size, motivated by prior work on transformer knowledge capacity for simple factual memorization. We find that capacity scaling and generalization both support the hypothesis that latent two-hop QA requires transformers to learn each fact twice, while two-hop QA with chain of thought does not. We also show that with appropriate dataset parameters, it is possible to "trap" very small models in a regime where they memorize answers to two-hop questions independently, even though they would perform better if they could learn to answer them with function composition. Our findings show that measurement of capacity scaling can complement existing interpretability methods, though there are challenges in using it for this purpose.
