Table of Contents
Fetching ...

AI Can Learn Scientific Taste

Jingqi Tong, Mingzhe Li, Hangcheng Li, Yongzhuo Yang, Yurong Mou, Weijie Ma, Zhiheng Xi, Hongji Chen, Xiaoran Liu, Qinyuan Cheng, Ming Zhang, Qiguang Chen, Weifeng Ge, Qipeng Guo, Tianlei Ying, Tianxiang Sun, Yining Zheng, Xinchi Chen, Jun Zhao, Ning Ding, Xuanjing Huang, Yugang Jiang, Xipeng Qiu

Abstract

Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist's executive capability, while enhancing an AI's scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.

AI Can Learn Scientific Taste

Abstract

Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist's executive capability, while enhancing an AI's scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.
Paper Structure (83 sections, 1 theorem, 14 equations, 4 figures, 13 tables)

This paper contains 83 sections, 1 theorem, 14 equations, 4 figures, 13 tables.

Key Result

Proposition 1

Even if $I(p_a) = +\infty$ and $I(p_b) = +\infty$, the ordering $p_a \succ p_b$ is well-defined whenever the limit $\lim_{N \to \infty} \left[ I_N(p_a) - I_N(p_b) \right]$ exists in $\mathbb{R} \cup \{+\infty\}$.

Figures (4)

  • Figure 1: (Left)Scientific Judge accuracy on SciJudgeBench; trained models (stars) outperform proprietary models. (Right) In-domain win rates of Scientific Thinker against their untrained base policies under ensemble evaluation.
  • Figure 2: Overview of Reinforcement Learning from Community Feedback (RLCF). (1) Community feedback is collected as pairwise preference signals from naturally occurring community behavior. (2) A preference model is trained via GRPO to predict which item in a pair receives stronger community reception. (3) A policy model is trained via comparison-based GRPO: for each input, the policy samples a group of outputs, the preference model conducts pairwise comparisons to produce scalar rewards, and the policy is updated accordingly. In this work, we instantiate RLCF for scientific taste learning, where community feedback is derived from citation signals.
  • Figure 3: Scaling performance of Scientific Judge on SciJudgeBench (in-domain). Both SciJudge-Qwen3-4B and SciJudge-Qwen3-30B improve consistently across categories throughout training.
  • Figure 4: Scientific Thinker's performance under different base policy models and reward models. The top row uses SciJudge-Qwen3-4B as the reward model, while the bottom row uses the baseline reward model, Qwen3-4B-Instruct.

Theorems & Definitions (3)

  • Definition 1: Pairwise Impact Ordering
  • Proposition 1
  • proof