IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data
Tao Feng, Lizhen Qu, Niket Tandon, Gholamreza Haffari
TL;DR
IRIS tackles the absence of tabular data in causal discovery by automatically collecting unstructured data, extracting variable values, and applying a hybrid causal-discovery pipeline that combines statistical methods with LLM-based verification. It introduces a missing-variable proposal to expand causal graphs iteratively, relaxes traditional acyclicity and causal-sufficiency assumptions, and demonstrates superior performance across multiple datasets with GPT-4o. The framework achieves robust improvements in precision, recall, and F1 while enabling real-time exploration from a small initial variable set. These capabilities enable scalable, verifiable causal discovery in domains where pre-existing datasets are scarce or unavailable, with important implications for rapid hypothesis generation and knowledge integration.
Abstract
Causal discovery is fundamental to scientific research, yet traditional statistical algorithms face significant challenges, including expensive data collection, redundant computation for known relations, and unrealistic assumptions. While recent LLM-based methods excel at identifying commonly known causal relations, they fail to uncover novel relations. We introduce IRIS (Iterative Retrieval and Integrated System for Real-Time Causal Discovery), a novel framework that addresses these limitations. Starting with a set of initial variables, IRIS automatically collects relevant documents, extracts variables, and uncovers causal relations. Our hybrid causal discovery method combines statistical algorithms and LLM-based methods to discover known and novel causal relations. In addition to causal discovery on initial variables, the missing variable proposal component of IRIS identifies and incorporates missing variables to expand the causal graphs. Our approach enables real-time causal discovery from only a set of initial variables without requiring pre-existing datasets.
