Table of Contents
Fetching ...

Interesting Scientific Idea Generation using Knowledge Graphs and LLMs: Evaluations with 100 Research Group Leaders

Xuemei Gu, Mario Krenn

TL;DR

SciMuse leverages a large knowledge graph built from millions of papers and GPT-4 to generate personalized research ideas for scientists. It conducts a large-scale human evaluation with over 100 experienced group leaders, producing 4,451 ratings for 2,996 ideas, enabling both supervised neural-network prediction and zero-shot LLM ranking of interest. The study shows that graph-derived features can predict interest with competitive accuracy and that advanced LLMs can rank ideas effectively even without human feedback, with performance improving in newer models. These findings suggest AI-assisted idea generation can augment interdisciplinary collaboration and potentially scale to automated experimentation in the future.

Abstract

The rapid growth of scientific literature makes it challenging for researchers to identify novel and impactful ideas, especially across disciplines. Modern artificial intelligence (AI) systems offer new approaches, potentially inspiring ideas not conceived by humans alone. But how compelling are these AI-generated ideas, and how can we improve their quality? Here, we introduce SciMuse, which uses 58 million research papers and a large-language model to generate research ideas. We conduct a large-scale evaluation in which over 100 research group leaders -- from natural sciences to humanities -- ranked more than 4,400 personalized ideas based on their interest. This data allows us to predict research interest using (1) supervised neural networks trained on human evaluations, and (2) unsupervised zero-shot ranking with large-language models. Our results demonstrate how future systems can help generating compelling research ideas and foster unforeseen interdisciplinary collaborations.

Interesting Scientific Idea Generation using Knowledge Graphs and LLMs: Evaluations with 100 Research Group Leaders

TL;DR

SciMuse leverages a large knowledge graph built from millions of papers and GPT-4 to generate personalized research ideas for scientists. It conducts a large-scale human evaluation with over 100 experienced group leaders, producing 4,451 ratings for 2,996 ideas, enabling both supervised neural-network prediction and zero-shot LLM ranking of interest. The study shows that graph-derived features can predict interest with competitive accuracy and that advanced LLMs can rank ideas effectively even without human feedback, with performance improving in newer models. These findings suggest AI-assisted idea generation can augment interdisciplinary collaboration and potentially scale to automated experimentation in the future.

Abstract

The rapid growth of scientific literature makes it challenging for researchers to identify novel and impactful ideas, especially across disciplines. Modern artificial intelligence (AI) systems offer new approaches, potentially inspiring ideas not conceived by humans alone. But how compelling are these AI-generated ideas, and how can we improve their quality? Here, we introduce SciMuse, which uses 58 million research papers and a large-language model to generate research ideas. We conduct a large-scale evaluation in which over 100 research group leaders -- from natural sciences to humanities -- ranked more than 4,400 personalized ideas based on their interest. This data allows us to predict research interest using (1) supervised neural networks trained on human evaluations, and (2) unsupervised zero-shot ranking with large-language models. Our results demonstrate how future systems can help generating compelling research ideas and foster unforeseen interdisciplinary collaborations.
Paper Structure (13 sections, 9 figures, 2 tables)

This paper contains 13 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: SciMuse suggests research ideas or collaborations using a knowledge graph and GPT-4. (a), Knowledge graph generation. Nodes represent scientific concepts extracted from 2.44 million paper titles and abstracts using the RAKE algorithm rose2010automatic, further refined with custom NLP techniques, manual review, GPT, and Wikipedia (to restore mistakenly removed concepts), resulting in a final list of 123,128 concepts. Edges are formed when two concepts co-occur in titles or abstracts of over 58 million papers from OpenAlex openalex, augmented with citation data as a proxy for impact. A mini-knowledge graph illustrates the connections for two example papers wakonig2019xjohnson2023ultrafast. (b), AI-generated research collaborations. We extract concepts from the publications of Researchers A and B, refine them using GPT-4, and identify relevant sub-networks in the knowledge graph. GPT-4 then uses these concept pairs, along with the researchers' research information, to generate personalized research ideas or collaboration projects.
  • Figure 2: Large-scale human evaluation within the Max Planck Society.(a)-(b), The map of Germany, based on the GISCO statistical unit dataset from Eurostat eurostatNUTS, shows the locations of the Max Planck Institutes and the participating group leaders. A total of 4,451 personalized AI-generated research suggestions were evaluated by 110 research group leaders. Each suggestion represents a potential collaboration between the evaluating researcher (Researcher A) and another researcher (Researcher B) from the Max Planck Society, visualized as bi-colored edges (orange for Researcher A, green for Researcher B). A purple circle indicates collaborations within the same institute, and edge transparency reflects the number of evaluated suggestions. Blue dots denote natural sciences (nat) and red dots represent social sciences (soc). (c): Distribution of interest ratings on a scale from 1 ('not interesting') to 5 ('very interesting'), with 394 suggestions rated as very interesting and 713 rated 4. Ratings are further categorized by whether the collaborations are within or across institutes, and by research field affiliation in either the natural or social sciences.
  • Figure 3: Analysis of interest levels versus knowledge graph features. We analyzed how eight features of the knowledge graph correlate with researchers' interest levels. After normalizing these features using z-scores, we arranged them from lowest to highest and divided the data into 50 equal groups. For each group, we plotted the average feature value (x-axis) against the average interest level (y-axis) with standard deviations, to identify trends. (a) and (b) show node features, (c)–(e) show node citation metrics, (f) shows an edge feature, (g) an edge citation metric, and (h) represents semantic distance between researchers’ sub-networks (higher values indicate that the researchers' scientific fields are further apart). Data points include all 2,996 responses (blue), the top 50% of concept pairs by predicted impact (green), and the top 25% (red), using the neural-network based impact prediction presented in gu2024forecasting.
  • Figure 4: Predicting Scientific Interest. We use two distinct methods to predict interest levels: (1) a supervised neural network trained on human evaluations using only knowledge graph data (not the text of the actual suggestion), and (2) GPT in a zero-shot setting, ranking suggestions without getting any feedback from human evaluations. Both methods classify suggestions as highly interesting (ratings of 4 or 5) or not (below 4). The neural network uses 25 knowledge graph features and employs Monte Carlo cross-validation for accuracy estimation. For GPT, we conduct pairwise comparisons using personalized research details and rank suggestions through an ELO-based tournment system. (a), The ROC curve shows prediction accuracy of 64.5% for the neural network and 67.3% for GPT-4o. (b), The precision for top-N suggestions is significantly higher than random selection, with the top-1 precision reaching 70% for the neural network (51.0% for GPT-4o and 52.9% for GPT-3.5) and top-5 precision at 60.4% (46.7% for GPT-4o, 43.7% for GPT-3.5). (c), The probability of having at least one high-interest suggestion among the top N recommendations is significantly higher for the supervised neural network compared to random selection. Practically, evaluation data from experienced researchers may not always be available, thus it is very encouraging that LLMs, even without human evaluation, can rank suggestions effectively such that the highest interesting ones appear first.
  • Figure S1: Interest levels across different generation methods. Research ideas are generated using three methods: (1) no concepts provided by the knowledge graph, (2) random concepts from the researchers' subnetwork, and (3) predicted high-impact concept pairs from the researchers' subnetwork. The figures displays: (a) overall interest levels (numbers within bars show the number of responses for that evaluation), (b) interest levels for ideas without using concepts from the knowledge graph, (c) interest levels with random concept pairs, and (d) interest levels using high-impact concept pairs (predicted by adapting the computational methods from gu2024forecasting, and applying them to a different and much larger knowledge graph).
  • ...and 4 more figures