HindSight: Evaluating LLM-Generated Research Ideas via Future Impact

Bo Jiang

HindSight: Evaluating LLM-Generated Research Ideas via Future Impact

Bo Jiang

Abstract

Evaluating AI-generated research ideas typically relies on LLM judges or human panels -- both subjective and disconnected from actual research impact. We introduce HindSight, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~$T$, we restrict an idea generation system to pre-$T$ literature, then evaluate its outputs against papers published in the subsequent 30 months. Experiments across 10 AI/ML research topics reveal a striking disconnect: LLM-as-Judge finds no significant difference between retrieval-augmented and vanilla idea generation ($p{=}0.584$), while HindSight shows the retrieval-augmented system produces 2.5$\times$ higher-scoring ideas ($p{<}0.001$). Moreover, HindSight scores are \emph{negatively} correlated with LLM-judged novelty ($ρ{=}{-}0.29$, $p{<}0.01$), suggesting that LLMs systematically overvalue novel-sounding ideas that never materialize in real research.

HindSight: Evaluating LLM-Generated Research Ideas via Future Impact

Abstract

, we restrict an idea generation system to pre-

literature, then evaluate its outputs against papers published in the subsequent 30 months. Experiments across 10 AI/ML research topics reveal a striking disconnect: LLM-as-Judge finds no significant difference between retrieval-augmented and vanilla idea generation (

), while HindSight shows the retrieval-augmented system produces 2.5

higher-scoring ideas (

). Moreover, HindSight scores are \emph{negatively} correlated with LLM-judged novelty (

), suggesting that LLMs systematically overvalue novel-sounding ideas that never materialize in real research.

Paper Structure (39 sections, 3 equations, 5 figures, 3 tables)

This paper contains 39 sections, 3 equations, 5 figures, 3 tables.

Introduction
Related Work
Research Idea Generation.
Evaluation Methods.
Time-Split Evaluation.
The HindSight Framework
Problem Formulation
Time-Split Design
Matching
Impact Scoring
Experimental Setup
Ground Truth Pool
Idea Generation Systems
ResearchAgent (retrieval-augmented).
Vanilla baseline (no retrieval).
...and 24 more sections

Figures (5)

Figure 1: The HindSight framework. An idea generation system accesses only pre-$T$ literature to produce research ideas. These are encoded alongside post-$T$ papers using SPECTER2, matched via FAISS, and scored by the matched papers' real-world citation impact and venue prestige.
Figure 2: Score distributions for both evaluation methods. (a)HindSight clearly separates the two systems, with the baseline clustering at zero. (b) LLM-as-Judge Overall scores are nearly identical ($p{=}0.584$). Diamond markers show means.
Figure 3: Threshold sensitivity. (a) At lenient thresholds ($\theta{\leq}0.93$) nearly all ideas match, reducing discriminative power. (b) The ratio of RA to BL mean HindSight scores grows monotonically from 1.1$\times$ to 3.8$\times$ as $\theta$ increases, confirming that the advantage is robust and amplified at stricter thresholds. Dotted lines mark $\theta{=}0.96$.
Figure 4: Each idea plotted by LLM-Judge Overall ($x$) and HindSight score ($y$). Dashed lines mark the medians used for quadrant classification. Retrieval-augmented ideas (blue) concentrate in the upper quadrants; baseline ideas (orange) cluster at $y{=}0$.
Figure 5: Spearman $\rho$ between HindSight and LLM-Judge dimensions. Stars denote significance: * $p{<}0.05$, ** $p{<}0.01$, *** $p{<}0.001$.

HindSight: Evaluating LLM-Generated Research Ideas via Future Impact

Abstract

HindSight: Evaluating LLM-Generated Research Ideas via Future Impact

Authors

Abstract

Table of Contents

Figures (5)