Table of Contents
Fetching ...

Learning to Predict Future-Aligned Research Proposals with Language Models

Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu, Jiawei Han, Heng Ji

Abstract

Large language models (LLMs) are increasingly used to assist ideation in research, but evaluating the quality of LLM-generated research proposals remains difficult: novelty and soundness are hard to measure automatically, and large-scale human evaluation is costly. We propose a verifiable alternative by reframing proposal generation as a time-sliced scientific forecasting problem. Given a research question and inspiring papers available before a cutoff time, the model generates a structured proposal and is evaluated by whether it anticipates research directions that appear in papers published after the time. We operationalize this objective with the Future Alignment Score (FAS), computed via retrieval and LLM-based semantic scoring against a held-out future corpus. To train models, we build a time-consistent dataset of 17,771 papers from targets and their pre-cutoff citations, and synthesize reasoning traces that teach gap identification and inspiration borrowing. Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality. Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method.

Learning to Predict Future-Aligned Research Proposals with Language Models

Abstract

Large language models (LLMs) are increasingly used to assist ideation in research, but evaluating the quality of LLM-generated research proposals remains difficult: novelty and soundness are hard to measure automatically, and large-scale human evaluation is costly. We propose a verifiable alternative by reframing proposal generation as a time-sliced scientific forecasting problem. Given a research question and inspiring papers available before a cutoff time, the model generates a structured proposal and is evaluated by whether it anticipates research directions that appear in papers published after the time. We operationalize this objective with the Future Alignment Score (FAS), computed via retrieval and LLM-based semantic scoring against a held-out future corpus. To train models, we build a time-consistent dataset of 17,771 papers from targets and their pre-cutoff citations, and synthesize reasoning traces that teach gap identification and inspiration borrowing. Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality. Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method.

Paper Structure

This paper contains 67 sections, 15 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Given inspiring papers $S$ and a research question $q$ available before a cutoff time $t_C$, the model generates a proposal $\tilde{P}$. We evaluate whether the proposal anticipates future human research directions by comparing it against papers published after $t_C$ using retrieval and LLM-based semantic alignment.
  • Figure 2: Overview of the proposed future-aligned learning framework. Time-consistent supervision constructs training data from historical papers without future leakage, and citation-grounded stepwise reasoning decomposes proposal generation into staged scientific planning. Together, these enable LoRA-based supervised fine-tuning of a proposal generator, which is evaluated by Future Alignment Score (FAS) against a held-out future corpus and further validated through human evaluation and execution-based case studies.
  • Figure 3: Pairwise human evaluation results (win/tie/lose). Each stacked bar shows the fraction of instances where Stepwise CoT is preferred (win), the two proposals are judged equivalent (tie), or Stepwise CoT is not preferred (lose), aggregated by majority vote across three annotators.
  • Figure 4: Two proposals generated by Qwen2.5-14B- Instruct (stepwise CoT tuned). The content is summarized for readability. The proposals are textually sound and are turned into reasonable experimental results and findings with the implementation and execution of code agents.
  • Figure 5: System prompts used for different proposal-generation variants.
  • ...and 8 more figures