InferenceEvolve: Towards Automated Causal Effect Estimators through Self-Evolving AI

Can Wang; Hongyu Zhao; Yiqun Chen

InferenceEvolve: Towards Automated Causal Effect Estimators through Self-Evolving AI

Can Wang, Hongyu Zhao, Yiqun Chen

Abstract

Causal inference is central to scientific discovery, yet choosing appropriate methods remains challenging because of the complexity of both statistical methodology and real-world data. Inspired by the success of artificial intelligence in accelerating scientific discovery, we introduce InferenceEvolve, an evolutionary framework that uses large language models to discover and iteratively refine causal methods. Across widely used benchmarks, InferenceEvolve yields estimators that consistently outperform established baselines: against 58 human submissions in a recent community competition, our best evolved estimator lay on the Pareto frontier across two evaluation metrics. We also developed robust proxy objectives for settings without semi-synthetic outcomes, with competitive results. Analysis of the evolutionary trajectories shows that agents progressively discover sophisticated strategies tailored to unrevealed data-generating mechanisms. These findings suggest that language-model-guided evolution can optimize structured scientific programs such as causal inference, even when outcomes are only partially observed.

InferenceEvolve: Towards Automated Causal Effect Estimators through Self-Evolving AI

Abstract

Paper Structure

This paper contains 58 sections, 14 equations, 10 figures, 13 tables, 2 algorithms.

Figures (10)

Figure 1: Overview of InferenceEvolve for causal estimator discovery. a, Schematic of the workflow: a causal task specification seeds a zero-shot program, which is then iteratively improved by an LLM ensemble under benchmark-based feedback, yielding an evolutionary trace and a final estimator. b, Principal-component view of the programs in the OpenAI's text-embedding-3-large embedding space across evolutions. Colors denote benchmarks (gray points denote zero-shot baselines), and darker points indicate later checkpoints within each benchmark. The inset line graph (no points obscured) summarizes normalized best-so-far progress over evolution iterations. c, ACIC 2022 thal_causal_2023 comparison against zero-shot programs and 58 human competition submissions. Top, RMSE distributions. Bottom, empirical 90% interval coverage distributions. Across the full distribution, evolved programs improve over zero-shot generation and compare favorably with human submissions.
Figure 1: Code length distributions across datasets and program sources. Top panels show character count and bottom panels show non-empty lines of code for baseline, true-evolved, and proxy-evolved programs. Across all four benchmarks, evolved programs are consistently longer than the compact baseline programs, with the largest expansions occurring for ACIC 2022 true-evolved code and ACIC 2016 proxy-evolved code.
Figure 2: Held-out performance distributions for baselines, true-evolved programs, and proxy-evolved programs, with text- and code-complexity summaries. Panels a--d show the metric distributions on the held-out set for all baseline methods and the final true-evolved and proxy-evolved runs. Panels e and f summarize GPT-5 methods length and lines of code for different programs across datasets.
Figure 2: Lengths of manuscript-style methods paragraphs derived from code. Each program was translated into a single scientific-manuscript methods paragraph by GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5 using the same unconstrained instruction template. Across datasets and translator models, baseline programs yield the shortest descriptions, while true-evolved and proxy-evolved programs produce substantially longer methods paragraphs.
Figure 3: InferenceEvolve converges to dataset-specific estimator families without collapsing onto a single public template. a, Majority-vote-based algorithm-family assignments for the 96 final evolved programs; each cell reports the number of final programs from that dataset assigned to that family. b, Majority-vote-based most similar published method. c, Novel algorithmic components extracted from the union of LLM-as-a-judge reports for each program. d, Tokenized TF-IDF cosine similarity between each program and its closest official reference wrapper across zero-shot baselines and checkpoints, with boxplots stratified by dataset.
...and 5 more figures