Bridging the Reproducibility Divide: Open Source Software's Role in Standardizing Healthcare AI

John Wu; Zhenbang Wu; Jimeng Sun

Bridging the Reproducibility Divide: Open Source Software's Role in Standardizing Healthcare AI

John Wu, Zhenbang Wu, Jimeng Sun

TL;DR

Tackling challenges through open-source development can improve reproducibility, which is essential for ensuring that AI models are safe, effective, and beneficial for patient care, and will help build more trustworthy AI systems that can be integrated into healthcare settings.

Abstract

Our analysis of recent AI4H publications reveals that, despite a trend toward utilizing open datasets and sharing modeling code, 74% of AI4H papers still rely on private datasets or do not share their code. This is especially concerning in healthcare applications, where trust is essential. Furthermore, inconsistent and poorly documented data preprocessing pipelines result in variable model performance reports, even for identical tasks and datasets, making it challenging to evaluate the true effectiveness of AI models. Despite the challenges posed by the reproducibility crisis, addressing these issues through open practices offers substantial benefits. For instance, while the reproducibility mandate adds extra effort to research and publication, it significantly enhances the impact of the work. Our analysis shows that papers that used both public datasets and shared code received, on average, 110% more citations than those that do neither--more than doubling the citation count. Given the clear benefits of enhancing reproducibility, it is imperative for the AI4H community to take concrete steps to overcome existing barriers. The community should promote open science practices, establish standardized guidelines for data preprocessing, and develop robust benchmarks. Tackling these challenges through open-source development can improve reproducibility, which is essential for ensuring that AI models are safe, effective, and beneficial for patient care. This approach will help build more trustworthy AI systems that can be integrated into healthcare settings, ultimately contributing to better patient outcomes and advancing the field of medicine.

Bridging the Reproducibility Divide: Open Source Software's Role in Standardizing Healthcare AI

TL;DR

Abstract

Paper Structure (28 sections, 10 figures, 3 tables)

This paper contains 28 sections, 10 figures, 3 tables.

Introduction
Technical Reproducibility Is a Cornerstone in AI4H
The AI4H Reproducibility Crisis
Private Datasets
Proprietary Code
Lack of Standardization in Data Processing Pipelines
Error Rates of Noisy Analysis
Existing Solutions Towards Improving Reproducibility
Guidelines and Standards
Automation and Tooling
Create Open-source Software and Benchmarks for Enhancing Reproducibility in AI4H
Open-Source Software and Benchmarks
Incentivizing Reproducibility
Lowering Barriers to Contribution
Celebrating Success in Reproducibility
...and 13 more sections

Figures (10)

Figure 1: Paper Data Collection and Analysis: (a) Paper scraping details: We scrape PDF files directly from conference webpages, extracting and cleaning important paper details like title, authors, abstracts, and emails. We query BioC to scrape paper information from PubMed's Open Access Database, related to research papers that contain the terms AI, deep learning, machine learning, healthcare, electronic health records, and electronic medical records into XML formats. (b) Stacked bar chart of total number of papers scraped each year, showing a steady increase in paper counts over time. (c) Key analyses performed: Mapping citation counts to each paper using scholarly services like SerpAPI serpapi2024, Semantic Scholar Kinney2023TheSS_Semantic_scholar, and PMIDcite gusenbauer2020academic_pmid_cite; scraping affiliation details by classifying emails or affiliations sourced from PubMed's MEDLINE API gusenbauer2020academic_pmid_cite; classifying each paper's topic by its abstract and title with a medically fine-tuned large language model (OpenBioLLM-70b); checking for code sharing by identifying specific code keywords (e.g., GitHub, Zenodo, Colab, GitLab) in each paper's main text (excluding the references); assessing public dataset usage by checking for famous dataset keywords in the main text and cross-referencing each paper with the PapersWithCode API. We go into further detail in Appendix \ref{['apd: Methodology Details']}.
Figure 2: Trends in Public Dataset Usage: (a) Distribution of public dataset usage over time and across venues: MIMIC is the most commonly used or mentioned public dataset; private datasets dominate overall each year; conferences use more public datasets than journal papers. (b) Rate of public dataset usage across topics: Biosignal papers use the most public datasets. (c) Rate of public dataset usage across affiliations: Industry surprisingly uses the most public datasets. (d) Distribution of public dataset usage over time: Papers using public datasets have higher citation counts on average every year. The maximum number of citations on the y-axis is capped at 50 to focus on non-outlier behavior. (e) Cumulative distribution of using public datasets: Papers that use public datasets tend to have greater citation counts, regardless of outlier status.
Figure 3: Trends in Code Sharing: (a) Code-sharing percentage across different venues over time. Conference papers share code significantly more often than other PubMed papers. (b) Rate of code sharing by topic: Biomedicine papers share code slightly more often. (c) Rate of code sharing by affiliation: Industry shares slightly less than others. (d) Distribution of citation counts for papers with and without code over time: Papers that share code generally have at least the same number of citations or more each year. Citation counts are limited to 50 or below to better visualize typical citation numbers beyond outlier papers. (e) Cumulative distribution of code sharing vs. no code sharing: Papers that share code overall have more citations, regardless of outlier status.
Figure 4: Prompt for standardizing noisily extracted conference paper titles.
Figure 5: Prompt template for extracting and cleaning email addresses from author details using a large language model. The {text} is the extracted author detail text from the noisy pdf files.
...and 5 more figures

Bridging the Reproducibility Divide: Open Source Software's Role in Standardizing Healthcare AI

TL;DR

Abstract

Bridging the Reproducibility Divide: Open Source Software's Role in Standardizing Healthcare AI

Authors

TL;DR

Abstract

Table of Contents

Figures (10)