Accelerating Social Science Research via Agentic Hypothesization and Experimentation
Jishu Sen Gupta, Harini SI, Somesh Kumar Singh, Syed Mohamad Tawseeq, Yaman Kumar Singla, David Doermann, Rajiv Ratn Shah, Balaji Krishnamurthy
TL;DR
The paper addresses the bottleneck of end-to-end discovery in data-driven social science by introducing EXPERIGEN, a two-agent framework that unifies hypothesis generation with empirical validation over unstructured data. A Generator proposes plausible, novel hypotheses, while an Experimenter operationalizes features, runs statistical tests, and reports structured evidence, guided by a Bayesian-optimization–inspired two-phase search. Across 10 diverse tasks, EXPERIGEN discovers 2-4x more statistically significant hypotheses that are 7-17% more predictive and generalizes to multimodal and relational datasets; expert reviews find high novelty and rigor, and a real-world A/B test shows a 344% uplift in conversions for a deployed hypothesis. The work demonstrates robust improvements in predictive performance, statistical validity, and practical impact, highlighting a scalable path toward end-to-end discovery in social science and related domains.
Abstract
Data-driven social science research is inherently slow, relying on iterative cycles of observation, hypothesis generation, and experimental validation. While recent data-driven methods promise to accelerate parts of this process, they largely fail to support end-to-end scientific discovery. To address this gap, we introduce EXPERIGEN, an agentic framework that operationalizes end-to-end discovery through a Bayesian optimization inspired two-phase search, in which a Generator proposes candidate hypotheses and an Experimenter evaluates them empirically. Across multiple domains, EXPERIGEN consistently discovers 2-4x more statistically significant hypotheses that are 7-17 percent more predictive than prior approaches, and naturally extends to complex data regimes including multimodal and relational datasets. Beyond statistical performance, hypotheses must be novel, empirically grounded, and actionable to drive real scientific progress. To evaluate these qualities, we conduct an expert review of machine-generated hypotheses, collecting feedback from senior faculty. Among 25 reviewed hypotheses, 88 percent were rated moderately or strongly novel, 70 percent were deemed impactful and worth pursuing, and most demonstrated rigor comparable to senior graduate-level research. Finally, recognizing that ultimate validation requires real-world evidence, we conduct the first A/B test of LLM-generated hypotheses, observing statistically significant results with p less than 1e-6 and a large effect size of 344 percent.
