Table of Contents
Fetching ...

IV Co-Scientist: Multi-Agent LLM Framework for Causal Instrumental Variable Discovery

Ivaxi Sheth, Zhijing Jin, Bryan Wilder, Dominik Janzing, Mario Fritz

TL;DR

This work investigates whether large language models can aid instrumental variable discovery under endogeneity by proposing IV Co-Scientist, a multi-agent framework that generates, critiques, and grounds candidate IVs for a treatment–outcome pair. The approach couples LLM-driven hypothesis generation with CriticAgent-based validation and a Grounder to map IVs to observable proxies, augmented by two evaluation axes: canonical-IV recovery and avoidance of invalid instruments, plus a novel consistency metric for internal validity in the absence of ground truth. Using Gapminder data, the study demonstrates that certain LLMs can recover literature-based instruments with high semantic alignment and that the CriticAgents effectively filter out invalid options, supporting the potential of LLMs as co-scientists in causal discovery. The results highlight a practical pathway to augment human causal reasoning with automated, context-aware hypothesis generation while outlining limitations related to grounding, generalizability, and reliance on domain knowledge. Overall, the paper advances principled, scalable early-stage IV discovery in high-dimensional observational data by integrating structured reasoning, statistical checks, and grounding steps.

Abstract

In the presence of confounding between an endogenous variable and the outcome, instrumental variables (IVs) are used to isolate the causal effect of the endogenous variable. Identifying valid instruments requires interdisciplinary knowledge, creativity, and contextual understanding, making it a non-trivial task. In this paper, we investigate whether large language models (LLMs) can aid in this task. We perform a two-stage evaluation framework. First, we test whether LLMs can recover well-established instruments from the literature, assessing their ability to replicate standard reasoning. Second, we evaluate whether LLMs can identify and avoid instruments that have been empirically or theoretically discredited. Building on these results, we introduce IV Co-Scientist, a multi-agent system that proposes, critiques, and refines IVs for a given treatment-outcome pair. We also introduce a statistical test to contextualize consistency in the absence of ground truth. Our results show the potential of LLMs to discover valid instrumental variables from a large observational database.

IV Co-Scientist: Multi-Agent LLM Framework for Causal Instrumental Variable Discovery

TL;DR

This work investigates whether large language models can aid instrumental variable discovery under endogeneity by proposing IV Co-Scientist, a multi-agent framework that generates, critiques, and grounds candidate IVs for a treatment–outcome pair. The approach couples LLM-driven hypothesis generation with CriticAgent-based validation and a Grounder to map IVs to observable proxies, augmented by two evaluation axes: canonical-IV recovery and avoidance of invalid instruments, plus a novel consistency metric for internal validity in the absence of ground truth. Using Gapminder data, the study demonstrates that certain LLMs can recover literature-based instruments with high semantic alignment and that the CriticAgents effectively filter out invalid options, supporting the potential of LLMs as co-scientists in causal discovery. The results highlight a practical pathway to augment human causal reasoning with automated, context-aware hypothesis generation while outlining limitations related to grounding, generalizability, and reliance on domain knowledge. Overall, the paper advances principled, scalable early-stage IV discovery in high-dimensional observational data by integrating structured reasoning, statistical checks, and grounding steps.

Abstract

In the presence of confounding between an endogenous variable and the outcome, instrumental variables (IVs) are used to isolate the causal effect of the endogenous variable. Identifying valid instruments requires interdisciplinary knowledge, creativity, and contextual understanding, making it a non-trivial task. In this paper, we investigate whether large language models (LLMs) can aid in this task. We perform a two-stage evaluation framework. First, we test whether LLMs can recover well-established instruments from the literature, assessing their ability to replicate standard reasoning. Second, we evaluate whether LLMs can identify and avoid instruments that have been empirically or theoretically discredited. Building on these results, we introduce IV Co-Scientist, a multi-agent system that proposes, critiques, and refines IVs for a given treatment-outcome pair. We also introduce a statistical test to contextualize consistency in the absence of ground truth. Our results show the potential of LLMs to discover valid instrumental variables from a large observational database.
Paper Structure (55 sections, 17 equations, 2 figures, 7 tables)

This paper contains 55 sections, 17 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Overview of the IV Co-Scientist framework, which integrates LLM-based agents with traditional statistical tools.
  • Figure 2: Comparison of the ATE density while using two different IVs: (a) LLM proposed and (b) random. This is for Sanitation $\rightarrow$ Mortality for GPT-4o.