Table of Contents
Fetching ...

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue, Ningyu Zhang, Hossein A. Rahmani, Yanshan Wang, Qiang Zhang, Keyan Ding, Jeff Z. Pan, Huajun Chen, Emine Yilmaz

TL;DR

InnoEval reframes idea evaluation as a knowledge-grounded, multi-perspective process that mirrors human peer review. It combines a heterogeneous knowledge search, a diversified reviewer board, and multi-criteria evaluation to produce actionable, evidence-backed evaluations and revision suggestions. Datasets of peer-reviewed ideas enable point-wise, pair-wise, and group-wise benchmarking, with InnoEval consistently outperforming baselines and showing strong human-aligned judgments. The work underscores the importance of living knowledge, consensus-building, and transparent feedback for scalable, high-quality scientific idea assessment.

Abstract

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

TL;DR

InnoEval reframes idea evaluation as a knowledge-grounded, multi-perspective process that mirrors human peer review. It combines a heterogeneous knowledge search, a diversified reviewer board, and multi-criteria evaluation to produce actionable, evidence-backed evaluations and revision suggestions. Datasets of peer-reviewed ideas enable point-wise, pair-wise, and group-wise benchmarking, with InnoEval consistently outperforming baselines and showing strong human-aligned judgments. The work underscores the importance of living knowledge, consensus-building, and transparent feedback for scalable, high-quality scientific idea assessment.

Abstract

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.
Paper Structure (32 sections, 12 equations, 5 figures, 4 tables)

This paper contains 32 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Framework of InnoEval. Structured idea parts with dashed boxes are optional. Given the raw-text idea, the deep knowledge search engine (left-hand) iterates $\boldsymbol{N}$ times, ultimately yielding enriched knowledge reports categorized into three types: literature, web, and code. During evaluation, each metric is assessed by a dedicated evaluator agent, and users can freely register additional metrics.
  • Figure 2: Left Heat Map: Human Evaluation. Correlations between InnoEval's scores on the five dimensions and the scores assigned by human experts (Human), as well as by online peer-review comments (Reviews) on the same five dimensions. Right Bar Charts: Ablation Studies. -Grounding removes the grounding module and feeds raw search results directly into evaluation; -Personalized disables the persona, letting the agent evaluate without assumed identity; -Web&Code restricts retrieval to relevant papers only (no web or code search); o4-mini swaps the backbone model for o4-mini; InnoEval denotes our full configuration.
  • Figure 3: (a)Multi-perspective Test-time Scaling. We compare the test-time scaling results with or without academic personas on point-wise and group-wise tasks. (b)Search Module Eval. We compare our heterogeneous deep knowledge search engine with the search modules of other baselines from four metrics. The specific definition of each metric can be found in Appx.\ref{['app:search_metrics']}. (c)Idea Generation. We use the evaluation results of different methods as feedback to improve the idea generation pipeline in ResearchAgent. (d)Metrics Influence. We explore critical metrics determining acceptance or highlighting of the idea by linear regression.
  • Figure 4: Scatter Plots Between Metric Pairs. We perform linear regression fitting for each metric pair. The red dashed line is the fit after removing outliers, and its slope is reported. $r$ denotes the Pearson coefficient, $\rho$ is the Spearman coefficient, and R$^2$ represents the fit goodness of all inliers explained by the fitted line. A complete version of all metric pairs' correlation can be seen in Fig.\ref{['fig:all_metrics_pair']}.
  • Figure 5: Scatter Plots Between Metric Pairs. We perform linear regression fitting for each metric pair. The red dashed line is the fit after removing outliers, and its slope is reported. $r$ denotes the Pearson coefficient, $\rho$ is the Spearman coefficient, and R$^2$ represents the fit goodness of all inliers explained by the fitted line.