Table of Contents
Fetching ...

Accelerating Social Science Research via Agentic Hypothesization and Experimentation

Jishu Sen Gupta, Harini SI, Somesh Kumar Singh, Syed Mohamad Tawseeq, Yaman Kumar Singla, David Doermann, Rajiv Ratn Shah, Balaji Krishnamurthy

TL;DR

The paper addresses the bottleneck of end-to-end discovery in data-driven social science by introducing EXPERIGEN, a two-agent framework that unifies hypothesis generation with empirical validation over unstructured data. A Generator proposes plausible, novel hypotheses, while an Experimenter operationalizes features, runs statistical tests, and reports structured evidence, guided by a Bayesian-optimization–inspired two-phase search. Across 10 diverse tasks, EXPERIGEN discovers 2-4x more statistically significant hypotheses that are 7-17% more predictive and generalizes to multimodal and relational datasets; expert reviews find high novelty and rigor, and a real-world A/B test shows a 344% uplift in conversions for a deployed hypothesis. The work demonstrates robust improvements in predictive performance, statistical validity, and practical impact, highlighting a scalable path toward end-to-end discovery in social science and related domains.

Abstract

Data-driven social science research is inherently slow, relying on iterative cycles of observation, hypothesis generation, and experimental validation. While recent data-driven methods promise to accelerate parts of this process, they largely fail to support end-to-end scientific discovery. To address this gap, we introduce EXPERIGEN, an agentic framework that operationalizes end-to-end discovery through a Bayesian optimization inspired two-phase search, in which a Generator proposes candidate hypotheses and an Experimenter evaluates them empirically. Across multiple domains, EXPERIGEN consistently discovers 2-4x more statistically significant hypotheses that are 7-17 percent more predictive than prior approaches, and naturally extends to complex data regimes including multimodal and relational datasets. Beyond statistical performance, hypotheses must be novel, empirically grounded, and actionable to drive real scientific progress. To evaluate these qualities, we conduct an expert review of machine-generated hypotheses, collecting feedback from senior faculty. Among 25 reviewed hypotheses, 88 percent were rated moderately or strongly novel, 70 percent were deemed impactful and worth pursuing, and most demonstrated rigor comparable to senior graduate-level research. Finally, recognizing that ultimate validation requires real-world evidence, we conduct the first A/B test of LLM-generated hypotheses, observing statistically significant results with p less than 1e-6 and a large effect size of 344 percent.

Accelerating Social Science Research via Agentic Hypothesization and Experimentation

TL;DR

The paper addresses the bottleneck of end-to-end discovery in data-driven social science by introducing EXPERIGEN, a two-agent framework that unifies hypothesis generation with empirical validation over unstructured data. A Generator proposes plausible, novel hypotheses, while an Experimenter operationalizes features, runs statistical tests, and reports structured evidence, guided by a Bayesian-optimization–inspired two-phase search. Across 10 diverse tasks, EXPERIGEN discovers 2-4x more statistically significant hypotheses that are 7-17% more predictive and generalizes to multimodal and relational datasets; expert reviews find high novelty and rigor, and a real-world A/B test shows a 344% uplift in conversions for a deployed hypothesis. The work demonstrates robust improvements in predictive performance, statistical validity, and practical impact, highlighting a scalable path toward end-to-end discovery in social science and related domains.

Abstract

Data-driven social science research is inherently slow, relying on iterative cycles of observation, hypothesis generation, and experimental validation. While recent data-driven methods promise to accelerate parts of this process, they largely fail to support end-to-end scientific discovery. To address this gap, we introduce EXPERIGEN, an agentic framework that operationalizes end-to-end discovery through a Bayesian optimization inspired two-phase search, in which a Generator proposes candidate hypotheses and an Experimenter evaluates them empirically. Across multiple domains, EXPERIGEN consistently discovers 2-4x more statistically significant hypotheses that are 7-17 percent more predictive than prior approaches, and naturally extends to complex data regimes including multimodal and relational datasets. Beyond statistical performance, hypotheses must be novel, empirically grounded, and actionable to drive real scientific progress. To evaluate these qualities, we conduct an expert review of machine-generated hypotheses, collecting feedback from senior faculty. Among 25 reviewed hypotheses, 88 percent were rated moderately or strongly novel, 70 percent were deemed impactful and worth pursuing, and most demonstrated rigor comparable to senior graduate-level research. Finally, recognizing that ultimate validation requires real-world evidence, we conduct the first A/B test of LLM-generated hypotheses, observing statistically significant results with p less than 1e-6 and a large effect size of 344 percent.
Paper Structure (112 sections, 3 equations, 17 figures, 7 tables)

This paper contains 112 sections, 3 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Inner Loop: Iterative Refinement Cycle. Given a dataset $\mathcal{D} = \{(x_i, y_i)\}$ and a seed hypothesis $H_{i,1}$, the system refines through $T$ steps. At refinement step $j$: (1) The Generator proposes hypothesis $H_{i,j}$ conditioned on short-term memory $\mathcal{M}_{i,j}$; (2) The Experimenter operationalizes the hypothesis by specifying required features; (3) The Feature Annotator augments $\mathcal{D}$ with the operationalized feature $f_H$ (e.g., has_cite); (4) The Code Interpreter executes statistical tests on the augmented data; (5) Evidence $E_H(\mathcal{D}) = (p, \delta)$ comprising p-value and effect size is returned. The memory $\mathcal{M}_{i,j} = \{(H_{i,k}, E_{H_{i,k}})\}_{k<j}$ accumulates hypothesis-evidence pairs from prior steps, enabling the Generator to propose refinements that address confounds or add contextual qualifiers. Hypotheses passing Bonferroni-corrected significance ($p < \alpha/T$) are candidates for the hypothesis bank $\mathcal{H}$. See \ref{['fig:architecture_outer_loop']} for the outer loop that orchestrates multiple refinement cycles. The exact prompt for the generator and experimenter are in \ref{['subsec:Agent Prompts']}
  • Figure 2: Outer Loop: Acquisition-Guided Exploration. At each outer iteration $i \in \{1,\ldots,N\}$, the Generator conditions on a dataset summary $\mathcal{D}$ and the current hypothesis bank $\mathcal{H}_{i-1}$ to induce a proposal distribution (shown in blue)$q_i(H)=q(H \mid \mathcal{D}, \mathcal{H}_{i-1})$. A seed hypothesis $H_{i,1}$ is sampled to implicitly maximize the acquisition objective $\mathcal{A}(H)= s_i(H)+\mathcal{N}(H,\mathcal{H}_{i-1})$, which balances quality and novelty. The seed enters the refinement loop (\ref{['fig:architecture']}), returning validated hypotheses that are added to $\mathcal{H}_i$. The resulting belief update is visualized as a distribution shift (shown in orange), reflecting how accepted hypotheses reshape the Generator’s implicit prior in subsequent iterations. For example, after observing hypotheses about concessions and framing as degree vs. binary, the Generator begins to surface a new latent feature, decision space, in later proposals.
  • Figure 3: Expert evaluation results across four dimensions (N=25 annotations from 5 domain experts). Experts rated 88% of hypotheses as novel, 76% as research-worthy, and perceived the experimental designs as reflecting senior graduate student level expertise.
  • Figure 4: Observation sampling dynamics across iterations. Accuracy (y-axis) vs. outer-loop iteration (x-axis) for four datasets. ExperiGen (random sampling) improves steadily throughout 30 iterations. Boosting saturates early (10--15 iterations) because residual errors increasingly overlap with existing hypotheses. Clustering performs comparably to random with higher variance. No Data underperforms throughout, confirming that concrete observations are necessary for hypothesis generation.
  • Figure 5: Inference scaling with hypothesis bank size. Accuracy (y-axis) vs. outer-loop iteration (x-axis). LLM-based inference plateaus after $\sim$20 iterations due to difficulty selecting relevant hypotheses from large candidate sets. An AutoML pipeline (gradient-boosted classifier) trained on Experimenter-extracted features continues improving up to 40 iterations. ExperiGen[AutoML] outperforms all other configurations after iteration 15 across all four datasets.
  • ...and 12 more figures