Table of Contents
Fetching ...

Sparks of Science: Hypothesis Generation Using Structured Paper Data

Charles O'Neill, Tirthankar Ghosal, Roberta Răileanu, Mike Walmsley, Thang Bui, Kevin Schawinski, Ioana Ciucă

TL;DR

This work presents HypoGen, a dataset of ~5500 structured problem-hypothesis pairs built on the Bit-Flip-Spark formalism plus an explicit Chain-of-Reasoning to mirror scientific ideation. It frames hypothesis generation as conditional language modeling and demonstrates that fine-tuning LLaMA-based models on HypoGen improves hypothesis quality, particularly in feasibility and domain alignment, as evaluated by automated metrics, LLM judges, and limited human studies. The paper provides a detailed preprocessing, dataset construction, and fine-tuning/inference pipeline, and releases HypoGen publicly to foster AI-assisted scientific discovery. It also discusses limitations of relying on LLM-based evaluation and outlines future directions for cross-domain generalization and more robust validation methods.

Abstract

Generating novel and creative scientific hypotheses is a cornerstone in achieving Artificial General Intelligence. Large language and reasoning models have the potential to aid in the systematic creation, selection, and validation of scientifically informed hypotheses. However, current foundation models often struggle to produce scientific ideas that are both novel and feasible. One reason is the lack of a dedicated dataset that frames Scientific Hypothesis Generation (SHG) as a Natural Language Generation (NLG) task. In this paper, we introduce HypoGen, the first dataset of approximately 5500 structured problem-hypothesis pairs extracted from top-tier computer science conferences structured with a Bit-Flip-Spark schema, where the Bit is the conventional assumption, the Spark is the key insight or conceptual leap, and the Flip is the resulting counterproposal. HypoGen uniquely integrates an explicit Chain-of-Reasoning component that reflects the intellectual process from Bit to Flip. We demonstrate that framing hypothesis generation as conditional language modelling, with the model fine-tuned on Bit-Flip-Spark and the Chain-of-Reasoning (and where, at inference, we only provide the Bit), leads to improvements in the overall quality of the hypotheses. Our evaluation employs automated metrics and LLM judge rankings for overall quality assessment. We show that by fine-tuning on our HypoGen dataset we improve the novelty, feasibility, and overall quality of the generated hypotheses. The HypoGen dataset is publicly available at huggingface.co/datasets/UniverseTBD/hypogen-dr1.

Sparks of Science: Hypothesis Generation Using Structured Paper Data

TL;DR

This work presents HypoGen, a dataset of ~5500 structured problem-hypothesis pairs built on the Bit-Flip-Spark formalism plus an explicit Chain-of-Reasoning to mirror scientific ideation. It frames hypothesis generation as conditional language modeling and demonstrates that fine-tuning LLaMA-based models on HypoGen improves hypothesis quality, particularly in feasibility and domain alignment, as evaluated by automated metrics, LLM judges, and limited human studies. The paper provides a detailed preprocessing, dataset construction, and fine-tuning/inference pipeline, and releases HypoGen publicly to foster AI-assisted scientific discovery. It also discusses limitations of relying on LLM-based evaluation and outlines future directions for cross-domain generalization and more robust validation methods.

Abstract

Generating novel and creative scientific hypotheses is a cornerstone in achieving Artificial General Intelligence. Large language and reasoning models have the potential to aid in the systematic creation, selection, and validation of scientifically informed hypotheses. However, current foundation models often struggle to produce scientific ideas that are both novel and feasible. One reason is the lack of a dedicated dataset that frames Scientific Hypothesis Generation (SHG) as a Natural Language Generation (NLG) task. In this paper, we introduce HypoGen, the first dataset of approximately 5500 structured problem-hypothesis pairs extracted from top-tier computer science conferences structured with a Bit-Flip-Spark schema, where the Bit is the conventional assumption, the Spark is the key insight or conceptual leap, and the Flip is the resulting counterproposal. HypoGen uniquely integrates an explicit Chain-of-Reasoning component that reflects the intellectual process from Bit to Flip. We demonstrate that framing hypothesis generation as conditional language modelling, with the model fine-tuned on Bit-Flip-Spark and the Chain-of-Reasoning (and where, at inference, we only provide the Bit), leads to improvements in the overall quality of the hypotheses. Our evaluation employs automated metrics and LLM judge rankings for overall quality assessment. We show that by fine-tuning on our HypoGen dataset we improve the novelty, feasibility, and overall quality of the generated hypotheses. The HypoGen dataset is publicly available at huggingface.co/datasets/UniverseTBD/hypogen-dr1.

Paper Structure

This paper contains 22 sections, 5 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: The HypoGen process begins with input paper abstracts, from which the structured Bit (the problem), Flip (the solution) and Spark (key insight) are extracted by OpenAI's o1 model. The Chain of Reasoning is extracted by the o1 model from the main body of the paper. These outputs are used to fine-tune a LLaMA-based model, which then generates hypotheses from the provided Bit. A judge module (Claude 3.7 Sonnet) assesses the overall quality based on novelty and feasibility.
  • Figure 2: Comparative analysis of the quality of generated hypotheses across nine experiments as evaluated by an LLM Judge Claude 3.7 Sonnet. Upper: Win rates comparing non-fine-tuned versus fine-tuned LLaMA 3.1-8B (LlaMA-8B-FT) and R1-distilled-LlaMA-3.1-8B (R1-distilled-8B-FT) models on novelty and feasibility, showing the consistent trade-off in which fine-tuned models excel at feasibility (74-86% win rate). Non-fine-tuned variants show greater novelty (54-86% win rate). Lower: Pairwise win rate heatmap (read on the horizontal) between human experts, fine-tuned models (LLaMA-8B-FT, R1-FT), and one-shot models (O1-1shot, LLaMA-8B-1shot, R1-1shot) across novelty, feasibility, and overall quality dimensions. Human hypotheses are the overall winners (82-90% win rate), with fine-tuned models achieving comparable feasibility scores (62-64% vs Human). The fine-tuned models perform better than their one-shot counterparts in overall quality (86-92% win rate).
  • Figure 3: Comparative analysis of the quality of generated hypotheses across nine experiments as evaluated by an LLM Judge o3-mini.