ACCORD: Closing the Commonsense Measurability Gap
François Roewer-Després, Jinyue Feng, Zining Zhu, Frank Rudzicz
TL;DR
ACCORD introduces a formalized, anti-factual framework to disentangle commonsense grounding from reasoning in large language models. By grounding reasoning templates to ConceptNet, constructing acyclic reasoning trees with controlled hops and distractors, and generating anti-factual contexts, ACCORD enables scalable, automated benchmarking of multi-hop commonsense reasoning. Empirical evaluation on CSQA-derived ACCORD_CSQA reveals that state-of-the-art LLMs degrade to random performance as reasoning hops increase, with a pronounced gap between factual and anti-factual performance signaling context unfaithfulness in models trained on w_def data. The framework provides a scalable path to close the commonsense measurability gap, offering reproducible benchmarks and a design that can extend to other datasets while maintaining rigorous control over artifacts and reasoning complexity.
Abstract
We present ACCORD, a framework and benchmark suite for disentangling the commonsense grounding and reasoning abilities of large language models (LLMs) through controlled, multi-hop counterfactuals. ACCORD introduces formal elements to commonsense reasoning to explicitly control and quantify reasoning complexity beyond the typical 1 or 2 hops. Uniquely, ACCORD can automatically generate benchmarks of arbitrary reasoning complexity, and so it scales with future LLM improvements. Benchmarking state-of-the-art LLMs -- including GPT-4o (2024-05-13), Llama-3-70B-Instruct, and Mixtral-8x22B-Instruct-v0.1 -- shows performance degrading to random chance with only moderate scaling, leaving substantial headroom for improvement. We release a leaderboard of the benchmark suite tested in this work, as well as code for automatically generating more complex benchmarks.
