Automated Hypothesis Validation with Agentic Sequential Falsifications
Kexin Huang, Ying Jin, Ryan Li, Michael Y. Li, Emmanuel Candès, Jure Leskovec
TL;DR
This paper introduces Popper, an agentic framework for automated validation of free-form hypotheses, inspired by Karl Popper’s falsification principle. Popper uses two specialized LLM agents to iteratively design and execute falsification experiments, converting p-values to e-values and aggregating evidence with any-time validity to strictly control the Type-I error. The framework is instantiated across six domains, demonstrating robust error control, improved power, and scalability, with expert human studies showing comparable performance to human scientists but with substantial time savings. Empirical results highlight the importance of principled error control for LLM-driven hypothesis validation and showcase Popper as a scalable tool for accelerating scientific discovery and cross-domain inference. The work also discusses limitations and avenues for extending the framework to broader error metrics and more complex experimental settings.
Abstract
Hypotheses are central to information acquisition, decision-making, and discovery. However, many real-world hypotheses are abstract, high-level statements that are difficult to validate directly. This challenge is further intensified by the rise of hypothesis generation from Large Language Models (LLMs), which are prone to hallucination and produce hypotheses in volumes that make manual validation impractical. Here we propose Popper, an agentic framework for rigorous automated validation of free-form hypotheses. Guided by Karl Popper's principle of falsification, Popper validates a hypothesis using LLM agents that design and execute falsification experiments targeting its measurable implications. A novel sequential testing framework ensures strict Type-I error control while actively gathering evidence from diverse observations, whether drawn from existing data or newly conducted procedures. We demonstrate Popper on six domains including biology, economics, and sociology. Popper delivers robust error control, high power, and scalability. Furthermore, compared to human scientists, Popper achieved comparable performance in validating complex biological hypotheses while reducing time by 10 folds, providing a scalable, rigorous solution for hypothesis validation.
