Table of Contents
Fetching ...

Literature Meets Data: A Synergistic Approach to Hypothesis Generation

Haokun Liu, Yangqiaoyu Zhou, Mingxuan Li, Chenfei Yuan, Chenhao Tan

TL;DR

This work tackles hypothesis generation by uniting literature-based insights with data-driven patterns through a multi-agent LLM framework. It introduces HypoGeniC for data-driven hypotheses and HypoRefine for literature-guided refinement, plus a union strategy to merge both sources while eliminating redundancy. Extensive automatic (OOD/IND, cross-model) and human evaluations across deception detection, AIGC detection, mental stress detection, and persuasive argument prediction show that the literature+data approach consistently outperforms baselines and enhances human decision-making. The results suggest that integrating knowledge from literature with empirical data yields more generalizable, useful hypotheses, enabling faster and more responsible scientific inquiry.

Abstract

AI holds promise for transforming scientific processes, including hypothesis generation. Prior work on hypothesis generation can be broadly categorized into theory-driven and data-driven approaches. While both have proven effective in generating novel and plausible hypotheses, it remains an open question whether they can complement each other. To address this, we develop the first method that combines literature-based insights with data to perform LLM-powered hypothesis generation. We apply our method on five different datasets and demonstrate that integrating literature and data outperforms other baselines (8.97\% over few-shot, 15.75\% over literature-based alone, and 3.37\% over data-driven alone). Additionally, we conduct the first human evaluation to assess the utility of LLM-generated hypotheses in assisting human decision-making on two challenging tasks: deception detection and AI generated content detection. Our results show that human accuracy improves significantly by 7.44\% and 14.19\% on these tasks, respectively. These findings suggest that integrating literature-based and data-driven approaches provides a comprehensive and nuanced framework for hypothesis generation and could open new avenues for scientific inquiry.

Literature Meets Data: A Synergistic Approach to Hypothesis Generation

TL;DR

This work tackles hypothesis generation by uniting literature-based insights with data-driven patterns through a multi-agent LLM framework. It introduces HypoGeniC for data-driven hypotheses and HypoRefine for literature-guided refinement, plus a union strategy to merge both sources while eliminating redundancy. Extensive automatic (OOD/IND, cross-model) and human evaluations across deception detection, AIGC detection, mental stress detection, and persuasive argument prediction show that the literature+data approach consistently outperforms baselines and enhances human decision-making. The results suggest that integrating knowledge from literature with empirical data yields more generalizable, useful hypotheses, enabling faster and more responsible scientific inquiry.

Abstract

AI holds promise for transforming scientific processes, including hypothesis generation. Prior work on hypothesis generation can be broadly categorized into theory-driven and data-driven approaches. While both have proven effective in generating novel and plausible hypotheses, it remains an open question whether they can complement each other. To address this, we develop the first method that combines literature-based insights with data to perform LLM-powered hypothesis generation. We apply our method on five different datasets and demonstrate that integrating literature and data outperforms other baselines (8.97\% over few-shot, 15.75\% over literature-based alone, and 3.37\% over data-driven alone). Additionally, we conduct the first human evaluation to assess the utility of LLM-generated hypotheses in assisting human decision-making on two challenging tasks: deception detection and AI generated content detection. Our results show that human accuracy improves significantly by 7.44\% and 14.19\% on these tasks, respectively. These findings suggest that integrating literature-based and data-driven approaches provides a comprehensive and nuanced framework for hypothesis generation and could open new avenues for scientific inquiry.

Paper Structure

This paper contains 63 sections, 2 equations, 9 figures, 17 tables.

Figures (9)

  • Figure 1: The literature-based approach (A) leverages existing human knowledge to generate hypotheses but struggles to adapt to new data and lacks empirical grounding. The data-driven approach (B) relies on large datasets to generate hypotheses, enabling adaptation to diverse scenarios but risking overfitting. The literature + data approach (C) combines the strengths of both, grounding hypotheses in human knowledge while incorporating empirical data to enhance adaptiveness. See algorithmic details in \ref{['sec:methods']}.
  • Figure 2: Instruction page for likert rating.
  • Figure 3: Annotation page for likert rating.
  • Figure 4: Instruction page for novelty check.
  • Figure 5: Annotation page for novelty check.
  • ...and 4 more figures