Table of Contents
Fetching ...

Prompting or Fine-tuning? Exploring Large Language Models for Causal Graph Validation

Yuni Susanti, Nina Holsmoelle

TL;DR

This paper assesses whether large language models can verify causality in graphs produced by statistical causal discovery by evaluating textual context between variable pairs. It compares two paradigms—prompt-based zero-/few-shot LLMs and fine-tuned LLMs (BERT and GPT)—across biomedical and open-domain datasets. Results show that fine-tuned models consistently outperform prompting approaches, with gains up to approximately 20 F1 points, highlighting the importance of supervised adaptation for causal relation classification. The findings inform scalable causal graph verification, while noting data annotation bottlenecks and outlining future work to extend analysis to multi-variable causal graphs.

Abstract

This study explores the capability of Large Language Models (LLMs) to evaluate causality in causal graphs generated by conventional statistical causal discovery methods-a task traditionally reliant on manual assessment by human subject matter experts. To bridge this gap in causality assessment, LLMs are employed to evaluate the causal relationships by determining whether a causal connection between variable pairs can be inferred from textual context. Our study compares two approaches: (1) prompting-based method for zero-shot and few-shot causal inference and, (2) fine-tuning language models for the causal relation prediction task. While prompt-based LLMs have demonstrated versatility across various NLP tasks, our experiments on biomedical and general-domain datasets show that fine-tuned models consistently outperform them, achieving up to a 20.5-point improvement in F1 score-even when using smaller-parameter language models. These findings provide valuable insights into the strengths and limitations of both approaches for causal graph evaluation.

Prompting or Fine-tuning? Exploring Large Language Models for Causal Graph Validation

TL;DR

This paper assesses whether large language models can verify causality in graphs produced by statistical causal discovery by evaluating textual context between variable pairs. It compares two paradigms—prompt-based zero-/few-shot LLMs and fine-tuned LLMs (BERT and GPT)—across biomedical and open-domain datasets. Results show that fine-tuned models consistently outperform prompting approaches, with gains up to approximately 20 F1 points, highlighting the importance of supervised adaptation for causal relation classification. The findings inform scalable causal graph verification, while noting data annotation bottlenecks and outlining future work to extend analysis to multi-variable causal graphs.

Abstract

This study explores the capability of Large Language Models (LLMs) to evaluate causality in causal graphs generated by conventional statistical causal discovery methods-a task traditionally reliant on manual assessment by human subject matter experts. To bridge this gap in causality assessment, LLMs are employed to evaluate the causal relationships by determining whether a causal connection between variable pairs can be inferred from textual context. Our study compares two approaches: (1) prompting-based method for zero-shot and few-shot causal inference and, (2) fine-tuning language models for the causal relation prediction task. While prompt-based LLMs have demonstrated versatility across various NLP tasks, our experiments on biomedical and general-domain datasets show that fine-tuned models consistently outperform them, achieving up to a 20.5-point improvement in F1 score-even when using smaller-parameter language models. These findings provide valuable insights into the strengths and limitations of both approaches for causal graph evaluation.

Paper Structure

This paper contains 13 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: Illustration of a causal graph.
  • Figure 2: Fine-tuning BERT.