Idea2Plan: Exploring AI-Powered Research Planning

Jin Huang; Silviu Cucerzan; Sujay Kumar Jauhar; Ryen W. White

Idea2Plan: Exploring AI-Powered Research Planning

Jin Huang, Silviu Cucerzan, Sujay Kumar Jauhar, Ryen W. White

TL;DR

Idea2Plan formalizes the task of transforming a research idea into a structured research plan and introduces Idea2Plan Bench, built from ICML 2025 Spotlight/Oral papers published after major LLM training cutoffs to avoid data leakage. It pairs extracted ideas and rubrics with human expert assessments and proposes JudgeEval to evaluate LLM-based judges against expert annotations. The study compares multiple LLMs and agent configurations (direct prompting vs ReAct) across a rubric-based evaluation, finding GPT-5 and GPT-5-mini to perform best, while the ReAct agent does not outperform simpler baselines. The results illuminate both the potential and current limits of AI-driven research planning and lay groundwork for future improvements and evaluation standards.

Abstract

Large language models (LLMs) have demonstrated significant potential to accelerate scientific discovery as valuable tools for analyzing data, generating hypotheses, and supporting innovative approaches in various scientific fields. In this work, we investigate how LLMs can handle the transition from conceptual research ideas to well-structured research plans. Effective research planning not only supports scientists in advancing their research but also represents a crucial capability for the development of autonomous research agents. Despite its importance, the field lacks a systematic understanding of LLMs' research planning capability. To rigorously measure this capability, we introduce the Idea2Plan task and Idea2Plan Bench, a benchmark built from 200 ICML 2025 Spotlight and Oral papers released after major LLM training cutoffs. Each benchmark instance includes a research idea and a grading rubric capturing the key components of valid plans. We further propose Idea2Plan JudgeEval, a complementary benchmark to assess the reliability of LLM-based judges against expert annotations. Experimental results show that GPT-5 and GPT-5-mini achieve the strongest performance on the benchmark, though substantial headroom remains for future improvement. Our study provides new insights into LLMs' capability for research planning and lay the groundwork for future progress.

Idea2Plan: Exploring AI-Powered Research Planning

TL;DR

Abstract

Idea2Plan: Exploring AI-Powered Research Planning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)