Table of Contents
Fetching ...

HARPA: A Testability-Driven, Literature-Grounded Framework for Research Ideation

Rosni Vasu, Peter Jansen, Pao Siangliulue, Cristina Sarasua, Abraham Bernstein, Peter Clark, Bhavana Dalvi Mishra

TL;DR

HARPA addresses the problem of generating hypotheses for automated scientific discovery that are both novel and executable by grounding ideas in the literature and adapting to execution constraints. It introduces a two-component framework: a Proposal Generator that uses a three-stage process (trend identification, hypothesis-space exploration, convergence) to produce literature-grounded HARPA_proposals, and a Scorer that estimates testability via an execution-informed reward model conditioned on the ASD agent. The key contributions include a Socratic-question-driven refinement, a literature-informed hypothesis design space, an interpretable reward-distillation training regime, and an agent-conditioned scoring mechanism that improves feasibility, grounding, and execution success. Empirical results show HARPA proposals are more executable and grounded than baselines, with a reward-trained scorer achieving significant improvements, signaling a step toward more capable AI-driven scientific discovery.

Abstract

While there has been a surge of interest in automated scientific discovery (ASD), especially with the emergence of LLMs, it remains challenging for tools to generate hypotheses that are both testable and grounded in the scientific literature. Additionally, existing ideation tools are not adaptive to prior experimental outcomes. We developed HARPA to address these challenges by incorporating the ideation workflow inspired by human researchers. HARPA first identifies emerging research trends through literature mining, then explores hypothesis design spaces, and finally converges on precise, testable hypotheses by pinpointing research gaps and justifying design choices. Our evaluations show that HARPA-generated hypothesis-driven research proposals perform comparably to a strong baseline AI-researcher across most qualitative dimensions (e.g., specificity, novelty, overall quality), but achieve significant gains in feasibility(+0.78, p$<0.05$, bootstrap) and groundedness (+0.85, p$<0.01$, bootstrap) on a 10-point Likert scale. When tested with the ASD agent (CodeScientist), HARPA produced more successful executions (20 vs. 11 out of 40) and fewer failures (16 vs. 21 out of 40), showing that expert feasibility judgments track with actual execution success. Furthermore, to simulate how researchers continuously refine their understanding of what hypotheses are both testable and potentially interesting from experience, HARPA learns a reward model that scores new hypotheses based on prior experimental outcomes, achieving approx. a 28\% absolute gain over HARPA's untrained baseline scorer. Together, these methods represent a step forward in the field of AI-driven scientific discovery.

HARPA: A Testability-Driven, Literature-Grounded Framework for Research Ideation

TL;DR

HARPA addresses the problem of generating hypotheses for automated scientific discovery that are both novel and executable by grounding ideas in the literature and adapting to execution constraints. It introduces a two-component framework: a Proposal Generator that uses a three-stage process (trend identification, hypothesis-space exploration, convergence) to produce literature-grounded HARPA_proposals, and a Scorer that estimates testability via an execution-informed reward model conditioned on the ASD agent. The key contributions include a Socratic-question-driven refinement, a literature-informed hypothesis design space, an interpretable reward-distillation training regime, and an agent-conditioned scoring mechanism that improves feasibility, grounding, and execution success. Empirical results show HARPA proposals are more executable and grounded than baselines, with a reward-trained scorer achieving significant improvements, signaling a step toward more capable AI-driven scientific discovery.

Abstract

While there has been a surge of interest in automated scientific discovery (ASD), especially with the emergence of LLMs, it remains challenging for tools to generate hypotheses that are both testable and grounded in the scientific literature. Additionally, existing ideation tools are not adaptive to prior experimental outcomes. We developed HARPA to address these challenges by incorporating the ideation workflow inspired by human researchers. HARPA first identifies emerging research trends through literature mining, then explores hypothesis design spaces, and finally converges on precise, testable hypotheses by pinpointing research gaps and justifying design choices. Our evaluations show that HARPA-generated hypothesis-driven research proposals perform comparably to a strong baseline AI-researcher across most qualitative dimensions (e.g., specificity, novelty, overall quality), but achieve significant gains in feasibility(+0.78, p, bootstrap) and groundedness (+0.85, p, bootstrap) on a 10-point Likert scale. When tested with the ASD agent (CodeScientist), HARPA produced more successful executions (20 vs. 11 out of 40) and fewer failures (16 vs. 21 out of 40), showing that expert feasibility judgments track with actual execution success. Furthermore, to simulate how researchers continuously refine their understanding of what hypotheses are both testable and potentially interesting from experience, HARPA learns a reward model that scores new hypotheses based on prior experimental outcomes, achieving approx. a 28\% absolute gain over HARPA's untrained baseline scorer. Together, these methods represent a step forward in the field of AI-driven scientific discovery.

Paper Structure

This paper contains 38 sections, 1 equation, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview of HARPA. Starting from a source paper, seed hypothesis derived from literature trends, HARPA constructs a world model of variables, values, and supporting evidence. The proposal generator consists of three stages (trend identification, hypothesis space exploration for divergence, proposal sampling for convergence) to produce candidate hypothesis-driven research proposals. A dedicated scorer employs reasoning-based reward model based on prior execution evidences to evaluate testability w.r.t target ASD agent.
  • Figure 2: HARPA's Proposal Generator: Divergence and convergence to literature grounded novel proposals
  • Figure 3: HARPA Scorer:1. Training Data Generation. HARPA generates candidate proposals $(P_{a}, P_{b})$, which are executed in the ASD-agent environment to produce raw execution traces $(E_{a}, E_{b})$. A teacher LLM analyzes these traces and outputs a high-fidelity rubric-style reasoning trace with justification and answer $(Reason\_trace (P_{a}, P_{b}))$. 2. Reasoning Distillation and Reward Modeling. The student model is distilled from these reasoning traces, initialized as a policy, and fine-tuned via RLVR using preference labels to produce a rubric‑style reasoning trace and a preference label (e.g., "Proposal A wins", an example trace in Appendix L \ref{['app:reasoning_trace']}).
  • Figure 4: Mean difference between HARPA's proposal generator and AI-Researcher across nine evaluation dimensions. Also reporting the familiarity and confidence score differences. Points show average differences, horizontal bars indicate 95% bootstrap confidence intervals ($10k$ resamples). Stars indicate significant difference computed using the nonparametric bootstrap test (* $p<0.05$, **$p<0.01$)
  • Figure 5: Execution results from CodeScientist. Left: outcome distribution across groups. Right: paired comparison of mean success rates showing HARPA significantly outperforms the baseline AI-Researcher.
  • ...and 2 more figures