ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning
Vy Vo, Lizhen Qu, Tao Feng, Yuncheng Hua, Xiaoxi Kang, Songhai Fan, Tim Dwyer, Lay-Ki Soon, Gholamreza Haffari
TL;DR
ACCESS addresses the gap between surface-level causal event detection and robust abstract causal reasoning by introducing a two-phase benchmark that grounds event abstractions in GLUCOSE and then constructs causal graphs over 725 abstractions. The pipeline blends automatic clustering and human annotation to produce 1,494 causal relations across 9,513 stories, enabling evaluation of both abstraction quality and causal discovery. Experiments show that statistical structure learning struggles on sparse, abstract graphs and that large language models still face challenges in non-contextual pairwise abstraction discovery, but incorporating ACCESS-derived abstract causal graphs significantly boosts QA reasoning in LLMs. The work provides a reproducible pipeline and insights into the necessity of improving abstraction granularity and abstract causal representation learning for robust AI reasoning.
Abstract
Identifying cause-and-effect relationships is critical to understanding real-world dynamics and ultimately causal reasoning. Existing methods for identifying event causality in NLP, including those based on Large Language Models (LLMs), exhibit difficulties in out-of-distribution settings due to the limited scale and heavy reliance on lexical cues within available benchmarks. Modern benchmarks, inspired by probabilistic causal inference, have attempted to construct causal graphs of events as a robust representation of causal knowledge, where \texttt{CRAB} \citep{romanou2023crab} is one such recent benchmark along this line. In this paper, we introduce \texttt{ACCESS}, a benchmark designed for discovery and reasoning over abstract causal events. Unlike existing resources, \texttt{ACCESS} focuses on causality of everyday life events on the abstraction level. We propose a pipeline for identifying abstractions for event generalizations from \texttt{GLUCOSE} \citep{mostafazadeh-etal-2020-glucose}, a large-scale dataset of implicit commonsense causal knowledge, from which we subsequently extract $1,4$K causal pairs. Our experiments highlight the ongoing challenges of using statistical methods and/or LLMs for automatic abstraction identification and causal discovery in NLP. Nonetheless, we demonstrate that the abstract causal knowledge provided in \texttt{ACCESS} can be leveraged for enhancing QA reasoning performance in LLMs.
