Table of Contents
Fetching ...

Automated Hypothesis Validation with Agentic Sequential Falsifications

Kexin Huang, Ying Jin, Ryan Li, Michael Y. Li, Emmanuel Candès, Jure Leskovec

TL;DR

This paper introduces Popper, an agentic framework for automated validation of free-form hypotheses, inspired by Karl Popper’s falsification principle. Popper uses two specialized LLM agents to iteratively design and execute falsification experiments, converting p-values to e-values and aggregating evidence with any-time validity to strictly control the Type-I error. The framework is instantiated across six domains, demonstrating robust error control, improved power, and scalability, with expert human studies showing comparable performance to human scientists but with substantial time savings. Empirical results highlight the importance of principled error control for LLM-driven hypothesis validation and showcase Popper as a scalable tool for accelerating scientific discovery and cross-domain inference. The work also discusses limitations and avenues for extending the framework to broader error metrics and more complex experimental settings.

Abstract

Hypotheses are central to information acquisition, decision-making, and discovery. However, many real-world hypotheses are abstract, high-level statements that are difficult to validate directly. This challenge is further intensified by the rise of hypothesis generation from Large Language Models (LLMs), which are prone to hallucination and produce hypotheses in volumes that make manual validation impractical. Here we propose Popper, an agentic framework for rigorous automated validation of free-form hypotheses. Guided by Karl Popper's principle of falsification, Popper validates a hypothesis using LLM agents that design and execute falsification experiments targeting its measurable implications. A novel sequential testing framework ensures strict Type-I error control while actively gathering evidence from diverse observations, whether drawn from existing data or newly conducted procedures. We demonstrate Popper on six domains including biology, economics, and sociology. Popper delivers robust error control, high power, and scalability. Furthermore, compared to human scientists, Popper achieved comparable performance in validating complex biological hypotheses while reducing time by 10 folds, providing a scalable, rigorous solution for hypothesis validation.

Automated Hypothesis Validation with Agentic Sequential Falsifications

TL;DR

This paper introduces Popper, an agentic framework for automated validation of free-form hypotheses, inspired by Karl Popper’s falsification principle. Popper uses two specialized LLM agents to iteratively design and execute falsification experiments, converting p-values to e-values and aggregating evidence with any-time validity to strictly control the Type-I error. The framework is instantiated across six domains, demonstrating robust error control, improved power, and scalability, with expert human studies showing comparable performance to human scientists but with substantial time savings. Empirical results highlight the importance of principled error control for LLM-driven hypothesis validation and showcase Popper as a scalable tool for accelerating scientific discovery and cross-domain inference. The work also discusses limitations and avenues for extending the framework to broader error metrics and more complex experimental settings.

Abstract

Hypotheses are central to information acquisition, decision-making, and discovery. However, many real-world hypotheses are abstract, high-level statements that are difficult to validate directly. This challenge is further intensified by the rise of hypothesis generation from Large Language Models (LLMs), which are prone to hallucination and produce hypotheses in volumes that make manual validation impractical. Here we propose Popper, an agentic framework for rigorous automated validation of free-form hypotheses. Guided by Karl Popper's principle of falsification, Popper validates a hypothesis using LLM agents that design and execute falsification experiments targeting its measurable implications. A novel sequential testing framework ensures strict Type-I error control while actively gathering evidence from diverse observations, whether drawn from existing data or newly conducted procedures. We demonstrate Popper on six domains including biology, economics, and sociology. Popper delivers robust error control, high power, and scalability. Furthermore, compared to human scientists, Popper achieved comparable performance in validating complex biological hypotheses while reducing time by 10 folds, providing a scalable, rigorous solution for hypothesis validation.

Paper Structure

This paper contains 28 sections, 1 theorem, 2 equations, 11 figures, 6 tables, 1 algorithm.

Key Result

Theorem 4

Define the aggregated evidence at the termination iteration as $E:= \prod_{s=1}^{\tau} e_s$. Under Assumptions assump:hypo, assump:evalue and assump:stopping, $E$ is a valid e-value, i.e., $\mathbb{E}[E]\leq 1$ under $H_0$. In addition, define the validation status as $\hat{y}=\mathop{\mathrm{\mathd

Figures (11)

  • Figure 1: Illustration of Popper. Given a hypothesis and a pre-defined significance level $\alpha\in (0,1)$, Popper constructs sequential experiments to falsify the hypothesis. Each iteration proceeds as follows. First, an experiment design agent proposes a falsification experiment, which is refined through a self-critique process considering factors such as causality, data availability, and redundancy. The experiment is then evaluated by an LLM-as-a-judge relevance checker to ensure its alignment with the main hypothesis. If deemed relevant, the test is implemented by a ReAct-based experiment execution agent which obtains a p-value. P-values from multiple falsification experiments are aggregated into sequential e-values using a sequential testing framework. If the aggregated e-value exceeds $1/\alpha$, we declare sufficient evidence to reject the null hypothesis. Otherwise, the process continues with the next falsification test.
  • Figure 2: Expert human study.Popper achieved similar power and Type-I error rates to human experts while significantly reducing task completion time. It also generated more lines of code and conducted more statistical tests. Qualitatively, Popper and human experts exhibited substantial overlap in both the designed falsification experiments and the statistical methods employed.
  • Figure 3: Characterization of Popper. (1) Popper designs biologically relevant falsification experiments. (2) It performs multiple logical steps to execute the experiment. (3) It employs a wide range of statistical tests. (4) Progression of cumulative e-values across multiple iterations of falsification tests. More details are available in Appendix \ref{['appendix:test_analysis']}.
  • Figure 4: Sensitivity analysis. (1) Empirical Type-I error at various nominal levels $\alpha$. (2) Power and Type-I error at various budgets as a function of the number of maximum tests.
  • Figure 5: Failure mode distribution for Popper, labaled automatically by O1 and manually checked by humans.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Theorem 4
  • proof : Proof of Theorem \ref{['thm:valid']}