Table of Contents
Fetching ...

HindSight: Evaluating LLM-Generated Research Ideas via Future Impact

Bo Jiang

Abstract

Evaluating AI-generated research ideas typically relies on LLM judges or human panels -- both subjective and disconnected from actual research impact. We introduce HindSight, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~$T$, we restrict an idea generation system to pre-$T$ literature, then evaluate its outputs against papers published in the subsequent 30 months. Experiments across 10 AI/ML research topics reveal a striking disconnect: LLM-as-Judge finds no significant difference between retrieval-augmented and vanilla idea generation ($p{=}0.584$), while HindSight shows the retrieval-augmented system produces 2.5$\times$ higher-scoring ideas ($p{<}0.001$). Moreover, HindSight scores are \emph{negatively} correlated with LLM-judged novelty ($ρ{=}{-}0.29$, $p{<}0.01$), suggesting that LLMs systematically overvalue novel-sounding ideas that never materialize in real research.

HindSight: Evaluating LLM-Generated Research Ideas via Future Impact

Abstract

Evaluating AI-generated research ideas typically relies on LLM judges or human panels -- both subjective and disconnected from actual research impact. We introduce HindSight, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~, we restrict an idea generation system to pre- literature, then evaluate its outputs against papers published in the subsequent 30 months. Experiments across 10 AI/ML research topics reveal a striking disconnect: LLM-as-Judge finds no significant difference between retrieval-augmented and vanilla idea generation (), while HindSight shows the retrieval-augmented system produces 2.5 higher-scoring ideas (). Moreover, HindSight scores are \emph{negatively} correlated with LLM-judged novelty (, ), suggesting that LLMs systematically overvalue novel-sounding ideas that never materialize in real research.
Paper Structure (39 sections, 3 equations, 5 figures, 3 tables)

This paper contains 39 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The HindSight framework. An idea generation system accesses only pre-$T$ literature to produce research ideas. These are encoded alongside post-$T$ papers using SPECTER2, matched via FAISS, and scored by the matched papers' real-world citation impact and venue prestige.
  • Figure 2: Score distributions for both evaluation methods. (a)HindSight clearly separates the two systems, with the baseline clustering at zero. (b) LLM-as-Judge Overall scores are nearly identical ($p{=}0.584$). Diamond markers show means.
  • Figure 3: Threshold sensitivity. (a) At lenient thresholds ($\theta{\leq}0.93$) nearly all ideas match, reducing discriminative power. (b) The ratio of RA to BL mean HindSight scores grows monotonically from 1.1$\times$ to 3.8$\times$ as $\theta$ increases, confirming that the advantage is robust and amplified at stricter thresholds. Dotted lines mark $\theta{=}0.96$.
  • Figure 4: Each idea plotted by LLM-Judge Overall ($x$) and HindSight score ($y$). Dashed lines mark the medians used for quadrant classification. Retrieval-augmented ideas (blue) concentrate in the upper quadrants; baseline ideas (orange) cluster at $y{=}0$.
  • Figure 5: Spearman $\rho$ between HindSight and LLM-Judge dimensions. Stars denote significance: * $p{<}0.05$, ** $p{<}0.01$, *** $p{<}0.001$.