Table of Contents
Fetching ...

Linear-LLM-SCM: Benchmarking LLMs for Coefficient Elicitation in Linear-Gaussian Causal Models

Kanta Yamaoka, Sumantrak Mukherjee, Thomas Gärtner, David Antony Selby, Stefan Konigorski, Eyke Hüllermeier, Viktor Bengs, Sebastian Josef Vollmer

TL;DR

This work addresses the challenge of quantitatively estimating causal parameters in continuous domains by introducing Linear-LLM-SCM, a plug-and-play benchmarking framework that elicites linear-SCM coefficients from pre-specified DAGs using local parent–child prompts. By decomposing the DAG into node-level parameterization tasks and applying iterative feedback with hard constraints, the framework evaluates LLMs on ground-truth coefficients and reports distance and ordering metrics $M_1$–$M_4$ to quantify accuracy and robustness. Empirical results across three LLMs (Gemini 2.5 Flash, Llama 3.1 8B, Llama 3.3 70B) and seven real-world DAGs reveal substantial variability and stochasticity in coefficient elicitation, with larger models often delivering better normalized performance on $M_3$ and $M_4$, yet exhibiting sensitivity to DAG misspecification and unit perturbations. The framework is open-sourced to enable researchers to benchmark their DAGs and LLMs, highlighting current limitations in quantitative causal parameterization and motivating future work toward non-linear extensions and robustness to structural noise.

Abstract

Large language models (LLMs) have shown potential in identifying qualitative causal relations, but their ability to perform quantitative causal reasoning -- estimating effect sizes that parametrize functional relationships -- remains underexplored in continuous domains. We introduce Linear-LLM-SCM, a plug-and-play benchmarking framework for evaluating LLMs on linear Gaussian structural causal model (SCM) parametrization when the DAG is given. The framework decomposes a DAG into local parent-child sets and prompts an LLM to produce a regression-style structural equation per node, which is aggregated and compared against available ground-truth parameters. Our experiments show several challenges in such benchmarking tasks, namely, strong stochasticity in the results in some of the models and susceptibility to DAG misspecification via spurious edges in the continuous domains. Across models, we observe substantial variability in coefficient estimates for some settings and sensitivity to structural and semantic perturbations, highlighting current limitations of LLMs as quantitative causal parameterizers. We also open-sourced the benchmarking framework so that researchers can utilize their DAGs and any off-the-shelf LLMs plug-and-play for evaluation in their domains effortlessly.

Linear-LLM-SCM: Benchmarking LLMs for Coefficient Elicitation in Linear-Gaussian Causal Models

TL;DR

This work addresses the challenge of quantitatively estimating causal parameters in continuous domains by introducing Linear-LLM-SCM, a plug-and-play benchmarking framework that elicites linear-SCM coefficients from pre-specified DAGs using local parent–child prompts. By decomposing the DAG into node-level parameterization tasks and applying iterative feedback with hard constraints, the framework evaluates LLMs on ground-truth coefficients and reports distance and ordering metrics to quantify accuracy and robustness. Empirical results across three LLMs (Gemini 2.5 Flash, Llama 3.1 8B, Llama 3.3 70B) and seven real-world DAGs reveal substantial variability and stochasticity in coefficient elicitation, with larger models often delivering better normalized performance on and , yet exhibiting sensitivity to DAG misspecification and unit perturbations. The framework is open-sourced to enable researchers to benchmark their DAGs and LLMs, highlighting current limitations in quantitative causal parameterization and motivating future work toward non-linear extensions and robustness to structural noise.

Abstract

Large language models (LLMs) have shown potential in identifying qualitative causal relations, but their ability to perform quantitative causal reasoning -- estimating effect sizes that parametrize functional relationships -- remains underexplored in continuous domains. We introduce Linear-LLM-SCM, a plug-and-play benchmarking framework for evaluating LLMs on linear Gaussian structural causal model (SCM) parametrization when the DAG is given. The framework decomposes a DAG into local parent-child sets and prompts an LLM to produce a regression-style structural equation per node, which is aggregated and compared against available ground-truth parameters. Our experiments show several challenges in such benchmarking tasks, namely, strong stochasticity in the results in some of the models and susceptibility to DAG misspecification via spurious edges in the continuous domains. Across models, we observe substantial variability in coefficient estimates for some settings and sensitivity to structural and semantic perturbations, highlighting current limitations of LLMs as quantitative causal parameterizers. We also open-sourced the benchmarking framework so that researchers can utilize their DAGs and any off-the-shelf LLMs plug-and-play for evaluation in their domains effortlessly.
Paper Structure (28 sections, 4 figures, 5 tables, 2 algorithms)

This paper contains 28 sections, 4 figures, 5 tables, 2 algorithms.

Figures (4)

  • Figure 1: An example of a prompt for a local parent-child structure in a DAG.
  • Figure 2: The DAG structure of cachexia1 from BnRep repository.
  • Figure 3: The DAG structure of expenditure from BnRep repository.
  • Figure 4: Inclusion and exclusion flowchart for DAG ground-truths from BnRep DAG repository