Table of Contents
Fetching ...

Idea2Plan: Exploring AI-Powered Research Planning

Jin Huang, Silviu Cucerzan, Sujay Kumar Jauhar, Ryen W. White

TL;DR

Idea2Plan formalizes the task of transforming a research idea into a structured research plan and introduces Idea2Plan Bench, built from ICML 2025 Spotlight/Oral papers published after major LLM training cutoffs to avoid data leakage. It pairs extracted ideas and rubrics with human expert assessments and proposes JudgeEval to evaluate LLM-based judges against expert annotations. The study compares multiple LLMs and agent configurations (direct prompting vs ReAct) across a rubric-based evaluation, finding GPT-5 and GPT-5-mini to perform best, while the ReAct agent does not outperform simpler baselines. The results illuminate both the potential and current limits of AI-driven research planning and lay groundwork for future improvements and evaluation standards.

Abstract

Large language models (LLMs) have demonstrated significant potential to accelerate scientific discovery as valuable tools for analyzing data, generating hypotheses, and supporting innovative approaches in various scientific fields. In this work, we investigate how LLMs can handle the transition from conceptual research ideas to well-structured research plans. Effective research planning not only supports scientists in advancing their research but also represents a crucial capability for the development of autonomous research agents. Despite its importance, the field lacks a systematic understanding of LLMs' research planning capability. To rigorously measure this capability, we introduce the Idea2Plan task and Idea2Plan Bench, a benchmark built from 200 ICML 2025 Spotlight and Oral papers released after major LLM training cutoffs. Each benchmark instance includes a research idea and a grading rubric capturing the key components of valid plans. We further propose Idea2Plan JudgeEval, a complementary benchmark to assess the reliability of LLM-based judges against expert annotations. Experimental results show that GPT-5 and GPT-5-mini achieve the strongest performance on the benchmark, though substantial headroom remains for future improvement. Our study provides new insights into LLMs' capability for research planning and lay the groundwork for future progress.

Idea2Plan: Exploring AI-Powered Research Planning

TL;DR

Idea2Plan formalizes the task of transforming a research idea into a structured research plan and introduces Idea2Plan Bench, built from ICML 2025 Spotlight/Oral papers published after major LLM training cutoffs to avoid data leakage. It pairs extracted ideas and rubrics with human expert assessments and proposes JudgeEval to evaluate LLM-based judges against expert annotations. The study compares multiple LLMs and agent configurations (direct prompting vs ReAct) across a rubric-based evaluation, finding GPT-5 and GPT-5-mini to perform best, while the ReAct agent does not outperform simpler baselines. The results illuminate both the potential and current limits of AI-driven research planning and lay groundwork for future improvements and evaluation standards.

Abstract

Large language models (LLMs) have demonstrated significant potential to accelerate scientific discovery as valuable tools for analyzing data, generating hypotheses, and supporting innovative approaches in various scientific fields. In this work, we investigate how LLMs can handle the transition from conceptual research ideas to well-structured research plans. Effective research planning not only supports scientists in advancing their research but also represents a crucial capability for the development of autonomous research agents. Despite its importance, the field lacks a systematic understanding of LLMs' research planning capability. To rigorously measure this capability, we introduce the Idea2Plan task and Idea2Plan Bench, a benchmark built from 200 ICML 2025 Spotlight and Oral papers released after major LLM training cutoffs. Each benchmark instance includes a research idea and a grading rubric capturing the key components of valid plans. We further propose Idea2Plan JudgeEval, a complementary benchmark to assess the reliability of LLM-based judges against expert annotations. Experimental results show that GPT-5 and GPT-5-mini achieve the strongest performance on the benchmark, though substantial headroom remains for future improvement. Our study provides new insights into LLMs' capability for research planning and lay the groundwork for future progress.

Paper Structure

This paper contains 47 sections, 5 figures, 31 tables.

Figures (5)

  • Figure 1: Idea2Plan overview: Starting from a research idea extracted from the abstract of an academic paper, we employ a pipeline that includes prompt-based and ReAct agent methods, integrated with search and paper reading tools, to generate research plans that are then evaluated against reference plans derived from the content of the corresponding paper by using a comprehensive grading rubric.
  • Figure 2: Average Planning Score for LLMs under four baselines (Naïve, 0-Shot, 1-Shot, ReAct). Scores are means over three independent runs per paper. We also include an upper-bound in which o4-mini is given the original paper to generate a plan, approximating the highest achievable performance on Idea2Plan Bench.
  • Figure 3: Relationship between plan length and Average Planning Score under constrained and expanded generation settings. Left: When response length is controlled, GPT-5 and GPT-5-mini generate shorter plans but remain the top performers under stricter length constraints. Right: When longer outputs are allowed, GPT-4.1 and o4-mini produce more extensive plans with higher scores. Arrows indicate the shift from the original to the constrained or expanded setting.
  • Figure 4: Impact of curated literature on average Planning Score across LLMs. Bars show baseline performance (solid) and the improvement with curated literature (hatched), which enhances plan quality across all models. DS denotes DeepSeek.
  • Figure 5: Pairwise win rate matrices across prompting settings. Each cell $(i,j)$ shows the fraction of research ideas where model $i$ outperforms model $j$. The four panels correspond to (a) naïve prompting, (b) 0-shot, (c) 1-shot, and (d) ReAct settings. Across all settings, GPT-5 achieves the highest overall win rates.