Towards Quantifying Commonsense Reasoning with Mechanistic Insights
Abhinav Joshi, Areeb Ahmad, Divyaksh Shukla, Ashutosh Modi
TL;DR
This paper tackles the challenge of evaluating commonsense reasoning in language models by introducing a graph-based resource derived from crowdsourced DeScript event sequences to model 37 real-world activities. It constructs directed Compact Graphs to enable enormous, sample-rich reasoning prompts (approximately $10^{17}$ queries per activity) and introduces trajectory entropy as a complexity measure, coupled with human-quality checks. A mechanistic-insights toolkit leverages MCQA prompts, direct-effect path patching, and conjugate prompts within a $do$-calculus framework to localize decision-making in transformer layers. Evaluations across six open-weight LLMs reveal that smaller models can rival larger ones under certain prompts, while localization analyses consistently point to specific layers (around $l \approx 20$ to $26$) as carrying the core decision-making signals. The work provides a scalable framework for rigorous evaluation and circuit discovery of commonsense reasoning, with open-source data and code, though it notes limitations in scope (37 activities) and generalization challenges for wild-world tasks.
Abstract
Commonsense reasoning deals with the implicit knowledge that is well understood by humans and typically acquired via interactions with the world. In recent times, commonsense reasoning and understanding of various LLMs have been evaluated using text-based tasks. In this work, we argue that a proxy of this understanding can be maintained as a graphical structure that can further help to perform a rigorous evaluation of commonsense reasoning abilities about various real-world activities. We create an annotation scheme for capturing this implicit knowledge in the form of a graphical structure for 37 daily human activities. We find that the created resource can be used to frame an enormous number of commonsense queries (~ 10^{17}), facilitating rigorous evaluation of commonsense reasoning in LLMs. Moreover, recently, the remarkable performance of LLMs has raised questions about whether these models are truly capable of reasoning in the wild and, in general, how reasoning occurs inside these models. In this resource paper, we bridge this gap by proposing design mechanisms that facilitate research in a similar direction. Our findings suggest that the reasoning components are localized in LLMs that play a prominent role in decision-making when prompted with a commonsense query.
