CausalEval: Towards Better Causal Reasoning in Language Models
Longxuan Yu, Delin Chen, Siheng Xiong, Qingyang Wu, Qingzhen Liu, Dawei Li, Zhikai Chen, Xiaoze Liu, Liangming Pan
TL;DR
CausalEval surveys methods to enhance causal reasoning in large language models and offers an empirical evaluation across diverse models and enhancement strategies. It classifies approaches by whether LLMs act as direct causal reasoning engines or as helpers for traditional CR methods, and provides a detailed taxonomy, evaluation framework, and benchmark-driven insights. The study finds that current LLMs exhibit shallow causal reasoning with a notable gap to human performance, though gains arise from prompting strategies, model scaling, and tool integration; it emphasizes data efficiency, interpretability, and hybrid symbolic-statistical approaches as key future directions. The work highlights the need for diverse, standardized benchmarks and robust data generation to better evaluate and advance LLMs' causal reasoning capabilities across domains.
Abstract
Causal reasoning (CR) is a crucial aspect of intelligence, essential for problem-solving, decision-making, and understanding the world. While language models (LMs) can generate rationales for their outputs, their ability to reliably perform causal reasoning remains uncertain, often falling short in tasks requiring a deep understanding of causality. In this paper, we introduce CausalEval, a comprehensive review of research aimed at enhancing LMs for causal reasoning, coupled with an empirical evaluation of current models and methods. We categorize existing methods based on the role of LMs: either as reasoning engines or as helpers providing knowledge or data to traditional CR methods, followed by a detailed discussion of methodologies in each category. We then assess the performance of current LMs and various enhancement methods on a range of causal reasoning tasks, providing key findings and in-depth analysis. Finally, we present insights from current studies and highlight promising directions for future research. We aim for this work to serve as a comprehensive resource, fostering further advancements in causal reasoning with LMs.
