Table of Contents
Fetching ...

ACCORD: Closing the Commonsense Measurability Gap

François Roewer-Després, Jinyue Feng, Zining Zhu, Frank Rudzicz

TL;DR

ACCORD introduces a formalized, anti-factual framework to disentangle commonsense grounding from reasoning in large language models. By grounding reasoning templates to ConceptNet, constructing acyclic reasoning trees with controlled hops and distractors, and generating anti-factual contexts, ACCORD enables scalable, automated benchmarking of multi-hop commonsense reasoning. Empirical evaluation on CSQA-derived ACCORD_CSQA reveals that state-of-the-art LLMs degrade to random performance as reasoning hops increase, with a pronounced gap between factual and anti-factual performance signaling context unfaithfulness in models trained on w_def data. The framework provides a scalable path to close the commonsense measurability gap, offering reproducible benchmarks and a design that can extend to other datasets while maintaining rigorous control over artifacts and reasoning complexity.

Abstract

We present ACCORD, a framework and benchmark suite for disentangling the commonsense grounding and reasoning abilities of large language models (LLMs) through controlled, multi-hop counterfactuals. ACCORD introduces formal elements to commonsense reasoning to explicitly control and quantify reasoning complexity beyond the typical 1 or 2 hops. Uniquely, ACCORD can automatically generate benchmarks of arbitrary reasoning complexity, and so it scales with future LLM improvements. Benchmarking state-of-the-art LLMs -- including GPT-4o (2024-05-13), Llama-3-70B-Instruct, and Mixtral-8x22B-Instruct-v0.1 -- shows performance degrading to random chance with only moderate scaling, leaving substantial headroom for improvement. We release a leaderboard of the benchmark suite tested in this work, as well as code for automatically generating more complex benchmarks.

ACCORD: Closing the Commonsense Measurability Gap

TL;DR

ACCORD introduces a formalized, anti-factual framework to disentangle commonsense grounding from reasoning in large language models. By grounding reasoning templates to ConceptNet, constructing acyclic reasoning trees with controlled hops and distractors, and generating anti-factual contexts, ACCORD enables scalable, automated benchmarking of multi-hop commonsense reasoning. Empirical evaluation on CSQA-derived ACCORD_CSQA reveals that state-of-the-art LLMs degrade to random performance as reasoning hops increase, with a pronounced gap between factual and anti-factual performance signaling context unfaithfulness in models trained on w_def data. The framework provides a scalable path to close the commonsense measurability gap, offering reproducible benchmarks and a design that can extend to other datasets while maintaining rigorous control over artifacts and reasoning complexity.

Abstract

We present ACCORD, a framework and benchmark suite for disentangling the commonsense grounding and reasoning abilities of large language models (LLMs) through controlled, multi-hop counterfactuals. ACCORD introduces formal elements to commonsense reasoning to explicitly control and quantify reasoning complexity beyond the typical 1 or 2 hops. Uniquely, ACCORD can automatically generate benchmarks of arbitrary reasoning complexity, and so it scales with future LLM improvements. Benchmarking state-of-the-art LLMs -- including GPT-4o (2024-05-13), Llama-3-70B-Instruct, and Mixtral-8x22B-Instruct-v0.1 -- shows performance degrading to random chance with only moderate scaling, leaving substantial headroom for improvement. We release a leaderboard of the benchmark suite tested in this work, as well as code for automatically generating more complex benchmarks.
Paper Structure (47 sections, 10 figures, 5 tables)

This paper contains 47 sections, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Differentiating types of reasoning. (a) Formal reasoning requires systematically applying formal rules, which must be provided explicitlyhuang2022towards. In this example, the formal rule states that, for any objects $x$, $y$, and $z$, if $x$ is a component part of (and therefore near) $y$ and $y$ is near $z$, then $x$ is near $z$. Notice that the conclusion cannot be formally derived from the facts without this rule. (b) Commonsense relies on prior knowledge (the semantics of "part of" and "spatial") to implicitly fill in knowledge gaps (the omitted rule), which is ill-defined huang2022towardsdavis2023benchmarks. (c) Formalized commonsense (our work) formalizes the reasoning elements while maintaining implicit knowledge gaps, which is well-defined and enables the automated verification of correctness. (d)Anti-factual formalized commonsense (our work) maintains the same formal reasoning elements and underlying rule, but grounds variables $x$, $y$, and $z$ with implausible objects to prevent LLMs from spuriously parroting the conclusion without having first reasoned through the facts wu2023reasoning.
  • Figure 2: The ACCORD framework (b-g) applied to a randomly-sampled CSQA instance (a). Notice that factual(f) and anti-factual(g) reasoning refers to whether the answer implied by the carefully-chosen negation of the statements matches the original factual answer (a). The statements themselves are always anti-factually grounded.
  • Figure 3: Performance of state-of-the-art LLMs on ACCORD$_\text{CSQA}$. Left: Both factual and anti-factual performance degrade rapidly with increasing reasoning hops, which is expected. Middle: Both factual and anti-factual performance increase with increasing distractors, which is unexpected. Right: Disentangling the interaction effect between reasoning hops and distractors to explain the unexpected result in (Middle). Reasoning hops are dominant while distractors' effect is negligible, which explains the reversed trend in (Middle) when marginalizing over reasoning hops. All:Factual significantly outperforms anti-factual, which indicates context unfaithfulness. As a consequence, anti-factual performance drops below random chance when reasoning hops exceed LLM reasoning capacity. Wald standard error bars are with respect to the 93 pairings, not reruns based on random seeds.
  • Figure 4: The ACCORD framework applied to CSQA. Top row: Manual preprocessing of CSQA. Bottom: Fully automated steps based on this preprocessing. (1) Generate all possible reasoning trees. (2) Pair each CSQA instance to all matching trees. (3) Find all $n$-hop reasoning paths to vary the number of reasoning hops and distractors. (4) For each path, duplicate the tree for each answer choice, then anti-factually ground variables. Legend:s, p, t in the abstractions are shorthands for spatial, part_of, and type_of, respectively.
  • Figure 5: Manipulating a pairing template to imply or contradict various answer choices.
  • ...and 5 more figures