Table of Contents
Fetching ...

Make Literature-Based Discovery Great Again through Reproducible Pipelines

Bojan Cestnik, Andrej Kastrin, Boshko Koloski, Nada Lavrač

TL;DR

This paper advances literature-based discovery by combining bisociative reasoning with a strong emphasis on reproducible science. It introduces open, dockerized Jupyter Notebook pipelines that implement traditional and novel LBD approaches and makes benchmark datasets (RS-DFO, Mig-Mg, Aut-CaN) freely available for replication. The work spans closed and open discovery methods, including ensemble bridging, outlier-based strategies, network-based link prediction, and an LLM-driven AHAM conceptualization of the LBD field. By providing end-to-end reproducible workflows and tutorials, it aims to reduce barriers to replication, foster collaboration, and accelerate robust, cross-domain scientific discoveries.

Abstract

By connecting disparate sources of scientific literature, literature\-/based discovery (LBD) methods help to uncover new knowledge and generate new research hypotheses that cannot be found from domain-specific documents alone. Our work focuses on bisociative LBD methods that combine bisociative reasoning with LBD techniques. The paper presents LBD through the lens of reproducible science to ensure the reproducibility of LBD experiments, overcome the inconsistent use of benchmark datasets and methods, trigger collaboration, and advance the LBD field toward more robust and impactful scientific discoveries. The main novelty of this study is a collection of Jupyter Notebooks that illustrate the steps of the bisociative LBD process, including data acquisition, text preprocessing, hypothesis formulation, and evaluation. The contributed notebooks implement a selection of traditional LBD approaches, as well as our own ensemble-based, outlier-based, and link prediction-based approaches. The reader can benefit from hands-on experience with LBD through open access to benchmark datasets, code reuse, and a ready-to-run Docker recipe that ensures reproducibility of the selected LBD methods.

Make Literature-Based Discovery Great Again through Reproducible Pipelines

TL;DR

This paper advances literature-based discovery by combining bisociative reasoning with a strong emphasis on reproducible science. It introduces open, dockerized Jupyter Notebook pipelines that implement traditional and novel LBD approaches and makes benchmark datasets (RS-DFO, Mig-Mg, Aut-CaN) freely available for replication. The work spans closed and open discovery methods, including ensemble bridging, outlier-based strategies, network-based link prediction, and an LLM-driven AHAM conceptualization of the LBD field. By providing end-to-end reproducible workflows and tutorials, it aims to reduce barriers to replication, foster collaboration, and accelerate robust, cross-domain scientific discoveries.

Abstract

By connecting disparate sources of scientific literature, literature\-/based discovery (LBD) methods help to uncover new knowledge and generate new research hypotheses that cannot be found from domain-specific documents alone. Our work focuses on bisociative LBD methods that combine bisociative reasoning with LBD techniques. The paper presents LBD through the lens of reproducible science to ensure the reproducibility of LBD experiments, overcome the inconsistent use of benchmark datasets and methods, trigger collaboration, and advance the LBD field toward more robust and impactful scientific discoveries. The main novelty of this study is a collection of Jupyter Notebooks that illustrate the steps of the bisociative LBD process, including data acquisition, text preprocessing, hypothesis formulation, and evaluation. The contributed notebooks implement a selection of traditional LBD approaches, as well as our own ensemble-based, outlier-based, and link prediction-based approaches. The reader can benefit from hands-on experience with LBD through open access to benchmark datasets, code reuse, and a ready-to-run Docker recipe that ensures reproducibility of the selected LBD methods.

Paper Structure

This paper contains 17 sections, 3 tables.