Table of Contents
Fetching ...

LiveIdeaBench: Evaluating LLMs' Divergent Thinking for Scientific Idea Generation with Minimal Context

Kai Ruan, Xuan Wang, Jixiang Hong, Peng Wang, Yang Liu, Hao Sun

TL;DR

LiveIdeaBench introduces a minimal-context benchmark to evaluate LLMs' divergent thinking for scientific ideation across 22 domains. By using single-keyword prompts and a dynamic panel of judge models, it scores ideas on originality, feasibility, fluency, flexibility, and clarity, revealing a weak link between general intelligence and scientific ideation. The results show domain-specific strengths and notable trade-offs, with some smaller models rivaling larger ones, underscoring the need for specialized benchmarks and training strategies for scientific idea generation. The work also discusses environmental costs, evaluation challenges, and directions for human-AI collaborative discovery in science.

Abstract

While Large Language Models (LLMs) demonstrate remarkable capabilities in scientific tasks such as literature analysis and experimental design (e.g., accurately extracting key findings from papers or generating coherent experimental procedures), existing evaluation benchmarks primarily assess performance using rich contextual inputs. We introduce LiveIdeaBench, a comprehensive benchmark evaluating LLMs' scientific idea generation by assessing divergent thinking capabilities using single-keyword prompts. Drawing from Guilford's creativity theory, our benchmark employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across five key dimensions: originality, feasibility, fluency, flexibility, and clarity. Through extensive experimentation with over 40 leading models across 1,180 keywords spanning 22 scientific domains, we reveal that the scientific idea generation capabilities measured by our benchmark, are poorly predicted by standard metrics of general intelligence. Our results demonstrate that models like QwQ-32B-preview achieve creative performance comparable to top-tier models such as claude-3.7-sonnet:thinking, despite significant gaps in their general intelligence scores. These findings highlight the need for specialized evaluation benchmarks for scientific idea generation and suggest that enhancing these idea generation capabilities in LLMs may require different training strategies than those used for improving general problem-solving abilities, potentially enabling a wider range of AI tools tailored for different stages of the scientific process.

LiveIdeaBench: Evaluating LLMs' Divergent Thinking for Scientific Idea Generation with Minimal Context

TL;DR

LiveIdeaBench introduces a minimal-context benchmark to evaluate LLMs' divergent thinking for scientific ideation across 22 domains. By using single-keyword prompts and a dynamic panel of judge models, it scores ideas on originality, feasibility, fluency, flexibility, and clarity, revealing a weak link between general intelligence and scientific ideation. The results show domain-specific strengths and notable trade-offs, with some smaller models rivaling larger ones, underscoring the need for specialized benchmarks and training strategies for scientific idea generation. The work also discusses environmental costs, evaluation challenges, and directions for human-AI collaborative discovery in science.

Abstract

While Large Language Models (LLMs) demonstrate remarkable capabilities in scientific tasks such as literature analysis and experimental design (e.g., accurately extracting key findings from papers or generating coherent experimental procedures), existing evaluation benchmarks primarily assess performance using rich contextual inputs. We introduce LiveIdeaBench, a comprehensive benchmark evaluating LLMs' scientific idea generation by assessing divergent thinking capabilities using single-keyword prompts. Drawing from Guilford's creativity theory, our benchmark employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across five key dimensions: originality, feasibility, fluency, flexibility, and clarity. Through extensive experimentation with over 40 leading models across 1,180 keywords spanning 22 scientific domains, we reveal that the scientific idea generation capabilities measured by our benchmark, are poorly predicted by standard metrics of general intelligence. Our results demonstrate that models like QwQ-32B-preview achieve creative performance comparable to top-tier models such as claude-3.7-sonnet:thinking, despite significant gaps in their general intelligence scores. These findings highlight the need for specialized evaluation benchmarks for scientific idea generation and suggest that enhancing these idea generation capabilities in LLMs may require different training strategies than those used for improving general problem-solving abilities, potentially enabling a wider range of AI tools tailored for different stages of the scientific process.

Paper Structure

This paper contains 20 sections, 25 figures, 5 tables.

Figures (25)

  • Figure 1: Overall design of the LiveIdeaBench benchmark.a. Over 1,000 scientific keywords, representing diverse domains, are used in prompts for the Idea LLMs, encouraging divergent thinking and the generation of novel scientific ideas. b. Sampled Judge LLMs evaluate the generated ideas across three primary dimensions: originality, feasibility and clarity, assigning numerical scores to each idea. c. The evaluation panel comprises the top 10 state-of-the-art models selected from LiveBench, ensuring robust assessment through sampling and ensemble scoring. d. Fluency scores are derived by analyzing the diversity and substantive differences among ideas generated from the same keyword (using a randomly sampled judge), while originality, feasibility, and clarity metrics are combined for integrated evaluation. e. Following Guilford's creativity theory, the evaluation methodology assesses five critical dimensions: originality, feasibility, clarity, fluency, and flexibility, with flexibility computed as the 30th percentile of the averaged scores across the other four dimensions. f. The LiveIdeaBench benchmark provides a comprehensive dataset of generated ideas, evaluation metrics, and a dynamic leaderboard tracking the performance of over 40 models, available at https://huggingface.co/datasets/6cf/liveideabench-v2 and https://liveideabench.com/.
  • Figure 1: Distribution of idea lengths measured in words
  • Figure 2: Performance comparison of models evaluated on LiveIdeaBench. a. Dimensional scores (originality, feasibility, fluency, clarity, and flexibility) and overall performance (red line) for open-weight and proprietary models, with 95% confidence intervals. b. Multidimensional performance profiles of representative models across the five evaluation dimensions. c. Word cloud visualization of scientific keywords. For detailed scores and 95% CIs for each model, see Supplementary Table S.3.
  • Figure 3: Model Performance on LiveIdeaBench Across Scientific Categories. The heatmap displays average performance scores with 95% confidence intervals for model-discipline combinations. Scientific categories were classified using SciBERT beltagy-etal-2019-scibert through semantic similarity computation following the framework from cohen2021boundary. Higher scores (darker blue) indicate better idea generation ability within each discipline. Numbers in parentheses following each scientific category indicate the keyword count associated with that discipline. Categories are sorted by keyword count.
  • Figure 3: Distribution of scores for ideas generated by all models
  • ...and 20 more figures