Linear-LLM-SCM: Benchmarking LLMs for Coefficient Elicitation in Linear-Gaussian Causal Models
Kanta Yamaoka, Sumantrak Mukherjee, Thomas Gärtner, David Antony Selby, Stefan Konigorski, Eyke Hüllermeier, Viktor Bengs, Sebastian Josef Vollmer
TL;DR
This work addresses the challenge of quantitatively estimating causal parameters in continuous domains by introducing Linear-LLM-SCM, a plug-and-play benchmarking framework that elicites linear-SCM coefficients from pre-specified DAGs using local parent–child prompts. By decomposing the DAG into node-level parameterization tasks and applying iterative feedback with hard constraints, the framework evaluates LLMs on ground-truth coefficients and reports distance and ordering metrics $M_1$–$M_4$ to quantify accuracy and robustness. Empirical results across three LLMs (Gemini 2.5 Flash, Llama 3.1 8B, Llama 3.3 70B) and seven real-world DAGs reveal substantial variability and stochasticity in coefficient elicitation, with larger models often delivering better normalized performance on $M_3$ and $M_4$, yet exhibiting sensitivity to DAG misspecification and unit perturbations. The framework is open-sourced to enable researchers to benchmark their DAGs and LLMs, highlighting current limitations in quantitative causal parameterization and motivating future work toward non-linear extensions and robustness to structural noise.
Abstract
Large language models (LLMs) have shown potential in identifying qualitative causal relations, but their ability to perform quantitative causal reasoning -- estimating effect sizes that parametrize functional relationships -- remains underexplored in continuous domains. We introduce Linear-LLM-SCM, a plug-and-play benchmarking framework for evaluating LLMs on linear Gaussian structural causal model (SCM) parametrization when the DAG is given. The framework decomposes a DAG into local parent-child sets and prompts an LLM to produce a regression-style structural equation per node, which is aggregated and compared against available ground-truth parameters. Our experiments show several challenges in such benchmarking tasks, namely, strong stochasticity in the results in some of the models and susceptibility to DAG misspecification via spurious edges in the continuous domains. Across models, we observe substantial variability in coefficient estimates for some settings and sensitivity to structural and semantic perturbations, highlighting current limitations of LLMs as quantitative causal parameterizers. We also open-sourced the benchmarking framework so that researchers can utilize their DAGs and any off-the-shelf LLMs plug-and-play for evaluation in their domains effortlessly.
