Narrating Causal Graphs with Large Language Models
Atharva Phatak, Vijay K. Mago, Ameeta Agrawal, Aravind Inbasekaran, Philippe J. Giabbanelli
TL;DR
This work probes whether large language models can generate coherent natural-language descriptions from causal maps. By evaluating four GPT-3 variants on two causal-map datasets under finetune, few-shot, and zero-shot regimes with inputs that either include or omit explicit causal tags, the study finds that causal text quality improves with training data and is closer to fine-tuned performance in few-shot settings, while zero-shot performance is substantially weaker. A key finding is that using a small number of exemplars can match full fine-tuning for many cases, enabling faster deployment, though causal tags generally help in non-zero-shot scenarios. The results, contrasted with a WebNLG baseline, suggest that GPT-3 can learn causality from limited examples but does not inherently encode it, underscoring the need for causality-focused evaluation metrics and extension to paragraph-level descriptions and intervention-focused reasoning.
Abstract
The use of generative AI to create text descriptions from graphs has mostly focused on knowledge graphs, which connect concepts using facts. In this work we explore the capability of large pretrained language models to generate text from causal graphs, where salient concepts are represented as nodes and causality is represented via directed, typed edges. The causal reasoning encoded in these graphs can support applications as diverse as healthcare or marketing. Using two publicly available causal graph datasets, we empirically investigate the performance of four GPT-3 models under various settings. Our results indicate that while causal text descriptions improve with training data, compared to fact-based graphs, they are harder to generate under zero-shot settings. Results further suggest that users of generative AI can deploy future applications faster since similar performances are obtained when training a model with only a few examples as compared to fine-tuning via a large curated dataset.
