Table of Contents
Fetching ...

Can Large Language Models Replace Data Scientists in Biomedical Research?

Zifeng Wang, Benjamin Danek, Ziwei Yang, Zheng Chen, Jimeng Sun

TL;DR

Can Large Language Models Replace Data Scientists in Biomedical Research? builds BioDSBench, a benchmark of 293 biomedical data-science tasks derived from 39 studies linked to patient-level TCGA-type data, to quantify LLM capabilities in biomedical data analysis and coding. The authors evaluate six leading LLMs with multiple adaptation strategies (CoT, few-shot, auto-prompt, RAG, self-reflection) and deploy a sandbox platform for human‑AI collaboration, plus a user study with five medical researchers. They find that vanilla prompting is insufficient, CoT yields large gains, self-reflection yields additional improvements, and RAG/AutoPrompt offer limited benefit; overall, LLMs cannot fully automate but can streamline workflows when integrated with expert users. The work provides practical insights into designing AI-assisted data science tools in biomedicine and highlights remaining safety, data privacy, and scalability considerations.

Abstract

Data science plays a critical role in biomedical research, but it requires professionals with expertise in coding and medical data analysis. Large language models (LLMs) have shown great potential in supporting medical tasks and performing well in general coding tests. However, existing evaluations fail to assess their capability in biomedical data science, particularly in handling diverse data types such as genomics and clinical datasets. To address this gap, we developed a benchmark of data science coding tasks derived from the analyses of 39 published studies. This benchmark comprises 293 coding tasks (128 in Python and 165 in R) performed on real-world TCGA-type genomics and clinical data. Our findings reveal that the vanilla prompting of LLMs yields suboptimal performances due to drawbacks in following input instructions, understanding target data, and adhering to standard analysis practices. Next, we benchmarked six cutting-edge LLMs and advanced adaptation methods, finding two methods to be particularly effective: chain-of-thought prompting, which provides a step-by-step plan for data analysis, which led to a 21% code accuracy improvement (56.6% versus 35.3%); and self-reflection, enabling LLMs to refine the buggy code iteratively, yielding an 11% code accuracy improvement (45.5% versus 34.3%). Building on these insights, we developed a platform that integrates LLMs into the data science workflow for medical professionals. In a user study with five medical professionals, we found that while LLMs cannot fully automate programming tasks, they significantly streamline the programming process. We found that 80% of their submitted code solutions were incorporated from LLM-generated code, with up to 96% reuse in some cases. Our analysis highlights the potential of LLMs to enhance data science efficiency in biomedical research when integrated into expert workflows.

Can Large Language Models Replace Data Scientists in Biomedical Research?

TL;DR

Can Large Language Models Replace Data Scientists in Biomedical Research? builds BioDSBench, a benchmark of 293 biomedical data-science tasks derived from 39 studies linked to patient-level TCGA-type data, to quantify LLM capabilities in biomedical data analysis and coding. The authors evaluate six leading LLMs with multiple adaptation strategies (CoT, few-shot, auto-prompt, RAG, self-reflection) and deploy a sandbox platform for human‑AI collaboration, plus a user study with five medical researchers. They find that vanilla prompting is insufficient, CoT yields large gains, self-reflection yields additional improvements, and RAG/AutoPrompt offer limited benefit; overall, LLMs cannot fully automate but can streamline workflows when integrated with expert users. The work provides practical insights into designing AI-assisted data science tools in biomedicine and highlights remaining safety, data privacy, and scalability considerations.

Abstract

Data science plays a critical role in biomedical research, but it requires professionals with expertise in coding and medical data analysis. Large language models (LLMs) have shown great potential in supporting medical tasks and performing well in general coding tests. However, existing evaluations fail to assess their capability in biomedical data science, particularly in handling diverse data types such as genomics and clinical datasets. To address this gap, we developed a benchmark of data science coding tasks derived from the analyses of 39 published studies. This benchmark comprises 293 coding tasks (128 in Python and 165 in R) performed on real-world TCGA-type genomics and clinical data. Our findings reveal that the vanilla prompting of LLMs yields suboptimal performances due to drawbacks in following input instructions, understanding target data, and adhering to standard analysis practices. Next, we benchmarked six cutting-edge LLMs and advanced adaptation methods, finding two methods to be particularly effective: chain-of-thought prompting, which provides a step-by-step plan for data analysis, which led to a 21% code accuracy improvement (56.6% versus 35.3%); and self-reflection, enabling LLMs to refine the buggy code iteratively, yielding an 11% code accuracy improvement (45.5% versus 34.3%). Building on these insights, we developed a platform that integrates LLMs into the data science workflow for medical professionals. In a user study with five medical professionals, we found that while LLMs cannot fully automate programming tasks, they significantly streamline the programming process. We found that 80% of their submitted code solutions were incorporated from LLM-generated code, with up to 96% reuse in some cases. Our analysis highlights the potential of LLMs to enhance data science efficiency in biomedical research when integrated into expert workflows.

Paper Structure

This paper contains 8 sections, 2 equations, 13 figures.

Figures (13)

  • Figure 1: Framework overview.a, we created a data science coding dataset based on the extracted analyses from medical publications. b, the total number of analysis tasks and studies in the testing data, which also covers a diverse set of tools and libraries. c, illustration of the complexity of the tasks by the distributions of question length and answer length. d, an example of semantic lines. e, the distribution of semantic lines in the reference answers across different difficulty levels. f, the selected models, adaptation methods, and coding tasks in this study.
  • Figure 1: A list of example medical publications we referred to create the data science coding tasks. For each study, we created five to over ten analysis tasks, and categorized each task into an analysis type, such as data processing and data exploration. The analyses are performed on multimodal patient data, such as patient clinical data, clinical sample data, and mutation data. The patient data sizes vary from tens to tens of thousands.
  • Figure 2: Assessment of different models and adaptation methods in automating biomedical data science tasks.a, the inputs for LLMs to generate the code and the associated evaluation process. b, the pass@5 of three LLMs with varying temperatures across difficulty levels in the Python coding dataset. c, the proportions of the reference solution code that can be drawn directly from the LLM-generated code. d and e show the pass@1 of six LLMs across difficulty levels in Python and R coding datasets, respectively.
  • Figure 2: An example of Python coding task with the input question, prefix code, and testing cases.
  • Figure 3: Exploration of strategic adaptations and their effectiveness. a, the inputs for LLMs' self-reflection are the testing logs, runtime logs, and the printing statements, from the initial code, and outputs the proposed solutions. b, study-level comparison of different adaptations versus vanilla methods. c and d, the Pass@1 with increasing rounds of self-reflections for Python and R tasks, respectively. e and f, the outcome classifications of code solutions before and after self-reflection for Python and R tasks, respectively. g, demonstrations of three error types.
  • ...and 8 more figures