ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Yujie Liu; Zonglin Yang; Tong Xie; Jinjie Ni; Ben Gao; Yuqiang Li; Shixiang Tang; Wanli Ouyang; Erik Cambria; Dongzhan Zhou

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Yujie Liu, Zonglin Yang, Tong Xie, Jinjie Ni, Ben Gao, Yuqiang Li, Shixiang Tang, Wanli Ouyang, Erik Cambria, Dongzhan Zhou

TL;DR

ResearchBench provides the first large-scale, cross-disciplinary benchmark for evaluating LLMs on the full scientific-discovery pipeline—inspiration retrieval, hypothesis composition, and hypothesis ranking—using 2024 papers to minimize data contamination. An automated agentic framework extracts core components (research questions, background, inspirations, hypotheses) and is validated by domain experts. Experiments across 12 disciplines show strong performance in inspiration retrieval, with more modest gains in composition and ranking, and identify inspiration retrieval as the main bottleneck for automated discovery. The work demonstrates the potential of LLMs as autonomous sources of novel hypotheses while outlining key challenges and directions to advance toward fully automated scientific exploration.

Abstract

Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery: inspiration retrieval, hypothesis composition, and hypothesis ranking. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on papers published in 2024, ensuring minimal overlap with LLM pretraining data. Our evaluation reveals that LLMs perform well in retrieving inspirations, an out-of-distribution task, suggesting their ability to surface novel knowledge associations. This positions LLMs as "research hypothesis mines", capable of facilitating automated scientific discovery by generating innovative hypotheses at scale with minimal human intervention.

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

TL;DR

Abstract

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)