Table of Contents
Fetching ...

BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks

Kai Zhang, Rong Zhou, Eashan Adhikarla, Zhiling Yan, Yixin Liu, Jun Yu, Zhengliang Liu, Xun Chen, Brian D. Davison, Hui Ren, Jing Huang, Chen Chen, Yuyin Zhou, Sunyang Fu, Wei Liu, Tianming Liu, Xiang Li, Yong Chen, Lifang He, James Zou, Quanzheng Li, Hongfang Liu, Lichao Sun

TL;DR

BiomedGPT is described, the first open-source and lightweight vision–language foundation model, designed as a generalist capable of performing various biomedical tasks, which achieves state-of-the-art accuracy in 16 out of 25 biomedical tasks, with promising performance in a series of potential clinical applications.

Abstract

Traditional biomedical artificial intelligence (AI) models, designed for specific tasks or modalities, often exhibit limited flexibility in real-world deployment and struggle to utilize holistic information. Generalist AI holds the potential to address these limitations due to its versatility in interpreting different data types and generating tailored outputs for diverse needs. However, existing biomedical generalist AI solutions are typically heavyweight and closed source to researchers, practitioners, and patients. Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model, designed as a generalist capable of performing various biomedical tasks. BiomedGPT achieved state-of-the-art results in 16 out of 25 experiments while maintaining a computing-friendly model scale. We also conducted human evaluations to assess the capabilities of BiomedGPT in radiology visual question answering, report generation, and summarization. BiomedGPT exhibits robust prediction ability with a low error rate of 3.8% in question answering, satisfactory performance with an error rate of 8.3% in writing complex radiology reports, and competitive summarization ability with a nearly equivalent preference score to human experts. Our method demonstrates that effective training with diverse data can lead to more practical biomedical AI for improving diagnosis and workflow efficiency.

BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks

TL;DR

BiomedGPT is described, the first open-source and lightweight vision–language foundation model, designed as a generalist capable of performing various biomedical tasks, which achieves state-of-the-art accuracy in 16 out of 25 biomedical tasks, with promising performance in a series of potential clinical applications.

Abstract

Traditional biomedical artificial intelligence (AI) models, designed for specific tasks or modalities, often exhibit limited flexibility in real-world deployment and struggle to utilize holistic information. Generalist AI holds the potential to address these limitations due to its versatility in interpreting different data types and generating tailored outputs for diverse needs. However, existing biomedical generalist AI solutions are typically heavyweight and closed source to researchers, practitioners, and patients. Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model, designed as a generalist capable of performing various biomedical tasks. BiomedGPT achieved state-of-the-art results in 16 out of 25 experiments while maintaining a computing-friendly model scale. We also conducted human evaluations to assess the capabilities of BiomedGPT in radiology visual question answering, report generation, and summarization. BiomedGPT exhibits robust prediction ability with a low error rate of 3.8% in question answering, satisfactory performance with an error rate of 8.3% in writing complex radiology reports, and competitive summarization ability with a nearly equivalent preference score to human experts. Our method demonstrates that effective training with diverse data can lead to more practical biomedical AI for improving diagnosis and workflow efficiency.
Paper Structure (66 sections, 7 equations, 17 figures, 8 tables)

This paper contains 66 sections, 7 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: BiomedGPT can process diverse modalities and perform versatile tasks. (a) BiomedGPT primarily focuses on visual and textual inputs but can also process tabular data through serialization. (b) Examples of the supported downstream visual-language tasks of BiomedGPT demonstrate its versatility. Additional tasks can be incorporated to meet further clinical needs via lightweight, task-specific fine-tuning. (c) Examples of clinical-relevant use cases with BiomedGPT include tasks, where the input may consist of both image and text or text-only, and the model responds to queries ("Q") by generating responses ("A"). Thanks to its unified framework design and comprehensive pretraining on biomedical data, BiomedGPT is highly adaptable and can be applied to a variety of downstream tasks.
  • Figure 2: The overview of BiomedGPT: workflow, performance, and pre-training datasets. (a) Graphical illustration of how BiomedGPT handles multimodal inputs and performs diverse downstream tasks. The expected form of output for each task is determined by feeding the specific instruction to the model. (b) Comparative performance analysis: this figure contrasts the achievements of BiomedGPT with prior SOTA results and Med-PaLM M (12B). The evaluation metrics include: accuracy for image classification, medical language inference, and visual question answering (VQA) (benchmarked against SOTA results); CIDEr for image captioning; ROUGE-L for text summarization; weighted F1 scores for VQA (in comparison with Med-PaLM M); and F1-macro for breast mass and
  • Figure 3: BiomedGPT performs fine-tuning for vision-language and medical image classification downstream tasks. (a) Medical VQA performance of BiomedGPT and the leading models in terms of closed-ended and open-ended accuracies. (b) Image captioning performance of BiomedGPT and SOTAs on IU X-ray, PEIR GROSS and MIMICCXR data. The evaluation metrics are ROUGEL-L, METEOR and CIDEr. (c) Evaluation of image classification on the MedMNIST-Raw dataset for each domain type. (d) Image classification performance with accuracy across two super-resolution image datasets. (e) Image classification performance with F1-macro on CBIS-DDSM dataset. (f) Accuracies across 9 datasets with different resolutions vary in terms of model scales. In general, larger models tend to perform better.
  • Figure 4: BiomedGPT performs few-epoch transfer learning for clinic text understanding and summarization and generates the response via zero-shot transfer learning. (a) Evaluation of models for treatment suggestion task in terms
  • Figure 5: Human evaluation of the VQA, text summarization, and captioning tasks. (a) Examples of human evaluation for three tasks in terms of response factuality, omissions, and significance of the errors. (b) Performance comparison
  • ...and 12 more figures