Table of Contents
Fetching ...

IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data

Tao Feng, Lizhen Qu, Niket Tandon, Gholamreza Haffari

TL;DR

IRIS tackles the absence of tabular data in causal discovery by automatically collecting unstructured data, extracting variable values, and applying a hybrid causal-discovery pipeline that combines statistical methods with LLM-based verification. It introduces a missing-variable proposal to expand causal graphs iteratively, relaxes traditional acyclicity and causal-sufficiency assumptions, and demonstrates superior performance across multiple datasets with GPT-4o. The framework achieves robust improvements in precision, recall, and F1 while enabling real-time exploration from a small initial variable set. These capabilities enable scalable, verifiable causal discovery in domains where pre-existing datasets are scarce or unavailable, with important implications for rapid hypothesis generation and knowledge integration.

Abstract

Causal discovery is fundamental to scientific research, yet traditional statistical algorithms face significant challenges, including expensive data collection, redundant computation for known relations, and unrealistic assumptions. While recent LLM-based methods excel at identifying commonly known causal relations, they fail to uncover novel relations. We introduce IRIS (Iterative Retrieval and Integrated System for Real-Time Causal Discovery), a novel framework that addresses these limitations. Starting with a set of initial variables, IRIS automatically collects relevant documents, extracts variables, and uncovers causal relations. Our hybrid causal discovery method combines statistical algorithms and LLM-based methods to discover known and novel causal relations. In addition to causal discovery on initial variables, the missing variable proposal component of IRIS identifies and incorporates missing variables to expand the causal graphs. Our approach enables real-time causal discovery from only a set of initial variables without requiring pre-existing datasets.

IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data

TL;DR

IRIS tackles the absence of tabular data in causal discovery by automatically collecting unstructured data, extracting variable values, and applying a hybrid causal-discovery pipeline that combines statistical methods with LLM-based verification. It introduces a missing-variable proposal to expand causal graphs iteratively, relaxes traditional acyclicity and causal-sufficiency assumptions, and demonstrates superior performance across multiple datasets with GPT-4o. The framework achieves robust improvements in precision, recall, and F1 while enabling real-time exploration from a small initial variable set. These capabilities enable scalable, verifiable causal discovery in domains where pre-existing datasets are scarce or unavailable, with important implications for rapid hypothesis generation and knowledge integration.

Abstract

Causal discovery is fundamental to scientific research, yet traditional statistical algorithms face significant challenges, including expensive data collection, redundant computation for known relations, and unrealistic assumptions. While recent LLM-based methods excel at identifying commonly known causal relations, they fail to uncover novel relations. We introduce IRIS (Iterative Retrieval and Integrated System for Real-Time Causal Discovery), a novel framework that addresses these limitations. Starting with a set of initial variables, IRIS automatically collects relevant documents, extracts variables, and uncovers causal relations. Our hybrid causal discovery method combines statistical algorithms and LLM-based methods to discover known and novel causal relations. In addition to causal discovery on initial variables, the missing variable proposal component of IRIS identifies and incorporates missing variables to expand the causal graphs. Our approach enables real-time causal discovery from only a set of initial variables without requiring pre-existing datasets.

Paper Structure

This paper contains 25 sections, 1 equation, 9 figures, 15 tables, 3 algorithms.

Figures (9)

  • Figure 1: Illustration of IRIS. Given initial variables, we use the Google Search API and LLMs to collect relevant documents and extract variable values, then form structured data. For hybrid causal discovery, the statistical branch uses the structured data, while the causal relation extraction branch uses the retrieved documents. Their results are merged into the final causal graph. The missing variable proposal component identifies new variables, which are iteratively fed into our framework to expand the causal graphs.
  • Figure 2: Evaluation results of causal discovery component on five datasets. A higher F1 score indicates better performance, while a lower NHD ratio reflects better performance. VCR refers to verified causal relations that are extracted from relevant academic documents and validated by LLMs. "Llama" refers to the use of the Llama-3.1-8b-instruct model as a substitute for GPT-4o in our method.
  • Figure 3: Illustration of expanded causal graphs for Cancer. Squared nodes represent initial variables, while round nodes denote new proposed variables.
  • Figure 4: Illustration of expanded causal graphs for Respiratory Disease. Squared nodes represent initial variables, while round nodes denote new proposed variables.
  • Figure 5: Illustration of expanded causal graphs for Diabetes. Squared nodes represent initial variables, while round nodes denote new proposed variables.
  • ...and 4 more figures