AI Idea Bench 2025: AI Research Idea Generation Benchmark
Yansheng Qiu, Haoquan Zhang, Zhaopan Xu, Ming Li, Diping Song, Zheng Wang, Kaipeng Zhang
TL;DR
AI Idea Bench 2025 introduces a comprehensive benchmark to quantify AI research idea generation by pairing 3,495 post-cutoff AI papers with inspiration sources and an open-ended, reference-based evaluation framework. It delineates six evaluation tasks (IMCQ, I2I, I2T, cross-benchmark competition, novelty, and feasibility) and demonstrates how multiple baselines perform under both target-paper alignment and external-reference analyses. The framework emphasizes ground-truth alignment to mitigate data leakage and uses objective similarity and feasibility measures to rank idea-generation methods. The work aims to catalyze robust, scalable evaluation of automated scientific discovery, while providing a concrete dataset and methodology for the community to benchmark and compare future idea-generation approaches.
Abstract
Large-scale Language Models (LLMs) have revolutionized human-AI interaction and achieved significant success in the generation of novel ideas. However, current assessments of idea generation overlook crucial factors such as knowledge leakage in LLMs, the absence of open-ended benchmarks with grounded truth, and the limited scope of feasibility analysis constrained by prompt design. These limitations hinder the potential of uncovering groundbreaking research ideas. In this paper, we present AI Idea Bench 2025, a framework designed to quantitatively evaluate and compare the ideas generated by LLMs within the domain of AI research from diverse perspectives. The framework comprises a comprehensive dataset of 3,495 AI papers and their associated inspired works, along with a robust evaluation methodology. This evaluation system gauges idea quality in two dimensions: alignment with the ground-truth content of the original papers and judgment based on general reference material. AI Idea Bench 2025's benchmarking system stands to be an invaluable resource for assessing and comparing idea-generation techniques, thereby facilitating the automation of scientific discovery.
