Table of Contents
Fetching ...

Towards Quantifying Commonsense Reasoning with Mechanistic Insights

Abhinav Joshi, Areeb Ahmad, Divyaksh Shukla, Ashutosh Modi

TL;DR

This paper tackles the challenge of evaluating commonsense reasoning in language models by introducing a graph-based resource derived from crowdsourced DeScript event sequences to model 37 real-world activities. It constructs directed Compact Graphs to enable enormous, sample-rich reasoning prompts (approximately $10^{17}$ queries per activity) and introduces trajectory entropy as a complexity measure, coupled with human-quality checks. A mechanistic-insights toolkit leverages MCQA prompts, direct-effect path patching, and conjugate prompts within a $do$-calculus framework to localize decision-making in transformer layers. Evaluations across six open-weight LLMs reveal that smaller models can rival larger ones under certain prompts, while localization analyses consistently point to specific layers (around $l \approx 20$ to $26$) as carrying the core decision-making signals. The work provides a scalable framework for rigorous evaluation and circuit discovery of commonsense reasoning, with open-source data and code, though it notes limitations in scope (37 activities) and generalization challenges for wild-world tasks.

Abstract

Commonsense reasoning deals with the implicit knowledge that is well understood by humans and typically acquired via interactions with the world. In recent times, commonsense reasoning and understanding of various LLMs have been evaluated using text-based tasks. In this work, we argue that a proxy of this understanding can be maintained as a graphical structure that can further help to perform a rigorous evaluation of commonsense reasoning abilities about various real-world activities. We create an annotation scheme for capturing this implicit knowledge in the form of a graphical structure for 37 daily human activities. We find that the created resource can be used to frame an enormous number of commonsense queries (~ 10^{17}), facilitating rigorous evaluation of commonsense reasoning in LLMs. Moreover, recently, the remarkable performance of LLMs has raised questions about whether these models are truly capable of reasoning in the wild and, in general, how reasoning occurs inside these models. In this resource paper, we bridge this gap by proposing design mechanisms that facilitate research in a similar direction. Our findings suggest that the reasoning components are localized in LLMs that play a prominent role in decision-making when prompted with a commonsense query.

Towards Quantifying Commonsense Reasoning with Mechanistic Insights

TL;DR

This paper tackles the challenge of evaluating commonsense reasoning in language models by introducing a graph-based resource derived from crowdsourced DeScript event sequences to model 37 real-world activities. It constructs directed Compact Graphs to enable enormous, sample-rich reasoning prompts (approximately queries per activity) and introduces trajectory entropy as a complexity measure, coupled with human-quality checks. A mechanistic-insights toolkit leverages MCQA prompts, direct-effect path patching, and conjugate prompts within a -calculus framework to localize decision-making in transformer layers. Evaluations across six open-weight LLMs reveal that smaller models can rival larger ones under certain prompts, while localization analyses consistently point to specific layers (around to ) as carrying the core decision-making signals. The work provides a scalable framework for rigorous evaluation and circuit discovery of commonsense reasoning, with open-source data and code, though it notes limitations in scope (37 activities) and generalization challenges for wild-world tasks.

Abstract

Commonsense reasoning deals with the implicit knowledge that is well understood by humans and typically acquired via interactions with the world. In recent times, commonsense reasoning and understanding of various LLMs have been evaluated using text-based tasks. In this work, we argue that a proxy of this understanding can be maintained as a graphical structure that can further help to perform a rigorous evaluation of commonsense reasoning abilities about various real-world activities. We create an annotation scheme for capturing this implicit knowledge in the form of a graphical structure for 37 daily human activities. We find that the created resource can be used to frame an enormous number of commonsense queries (~ 10^{17}), facilitating rigorous evaluation of commonsense reasoning in LLMs. Moreover, recently, the remarkable performance of LLMs has raised questions about whether these models are truly capable of reasoning in the wild and, in general, how reasoning occurs inside these models. In this resource paper, we bridge this gap by proposing design mechanisms that facilitate research in a similar direction. Our findings suggest that the reasoning components are localized in LLMs that play a prominent role in decision-making when prompted with a commonsense query.

Paper Structure

This paper contains 15 sections, 8 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Quantifying commonsense reasoning in Large Langauge Models (LLMs).
  • Figure 2: The figure provides an overview of the proposed resource. Real-world activities (well understood by humans) are considered to capture commonsense knowledge about these activities via human crowdsource workers. These ESDs are used to create a graphical representation of these activities and the underlying commonsense knowledge. The graphical representations help create enormous commonsense queries ($\sim 10^{17}$ queries per activity). The created resource of commonsense queries is reverified via data quality checks from humans. The overall flexibility attained using the graphical representations helps tease apart the reasoning mechanisms of LLMs, creating a tool for mechanistic insights into commonsense reasoning.
  • Figure 3: The figures highlight the computation of direct effect via path patching. (a) A run with the clean prompt ($x_{i<t}$) is passed through the model, saving all the intermediate states. (b) A model pass is again done using a conjugate prompt (${\color{red}{{\bar{x}}}}_{i<t}$) that flips the expected behavior of the model from green option to black option. (c) A run for computing the direct effect is done, where a path patching takes place for $f_{\theta_l}$, i.e., the green signal is patched to the conjugate run. The change in logit values helps localize the decision-making component that plays a vital role in the model selecting green as the correct choice.
  • Figure 4: Success rates of different models compared across the number of shots of in-context examples.
  • Figure 5: The figure shows the direct effect of path patching from the clean run to the conjugate run ('going bowling'), leading to deviations starting at layer 20 and increased signal strength at layer 26, highlighting the role of particular layers in commonsense reasoning.
  • ...and 14 more figures