MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine

Kai Zhang; Zhengqing Yuan; Cheng Peng; Songlin Zhao; Mengxian Lyu; Ziyi Chen; Yanfang Ye; Wei Liu; Ying Zhang; Kaleb E Smith; Lifang He; Lichao Sun; Yonghui Wu

MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine

Kai Zhang, Zhengqing Yuan, Cheng Peng, Songlin Zhao, Mengxian Lyu, Ziyi Chen, Yanfang Ye, Wei Liu, Ying Zhang, Kaleb E Smith, Lifang He, Lichao Sun, Yonghui Wu

TL;DR

This work introduces MEDGPT-OSS, an open-weight, 20B-parameter generalist vision-language model designed to facilitate open research in clinical AI, which successfully outperforms larger open medical models on out-of-distribution multimodal reasoning and complex text-only clinical tasks.

Abstract

Biomedical multimodal assistants have the potential to unify radiology, pathology, and clinical-text reasoning, yet a critical deployment gap remains: top-performing systems are either closed-source or computationally prohibitive, precluding the on-premises deployment required for patient privacy and PHI compliance. We introduce MEDGPT-OSS, an open-weight, 20B-parameter generalist vision-language model designed to facilitate open research in clinical AI. Rather than relying on architectural complexity, MEDGPT-OSS pairs the GPT-oss language backbone with a visual front-end via a optimized, three-stage training curriculum. By progressively domain-adapting these modules through rigorous data curation and long-context multimodal alignment, we demonstrate that a 20B model can bridge the capacity gap. It successfully outperforms larger open medical models on out-of-distribution (OOD) multimodal reasoning and complex text-only clinical tasks. By unifying diverse modalities under a single instruction-following interface, MEDGPT-OSS maintains a parameter-efficient footprint fully compatible with commodity GPUs. We release the complete training recipe, open-weight checkpoints, and a rigorous evaluation harness to serve as a verifiable foundation for privacy-preserving, institution-specific clinical AI research.

MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine

TL;DR

Abstract

Paper Structure (33 sections, 2 figures, 6 tables)

This paper contains 33 sections, 2 figures, 6 tables.

Introduction
Related Work
Specialist pipelines and their limits.
Generalist biomedical VLMs and instruction-tuned assistants.
Frontier trends: recipes, long-context grounding, and evaluation.
Positioning of MedGPT-oss.
MedGPT-oss
Model Architecture
Visual Encoder.
Projection Module.
LLM Backbone.
Training Strategies
Pretraining.
Mid-training.
Instruction-tuning.
...and 18 more sections

Figures (2)

Figure 1: Preliminary evaluation of visual encoders on medical multimodal benchmarks. As an initial investigation, we compared the vanilla CLIP backbone against domain-specific alternatives (BiomedCLIP, MedSigLIP) and SigLIP. The models utilize a GPT-oss-20B trained via LoRA (2-stage training following vanilla LLaVA, where we first pretrain the projector, then fine-tune the projector and LLM backbone).
Figure 2: Evaluation of multi-view and longitudinal chest X-ray report generation on the MIMIC-CXR benchmark. Performance is measured across three clinically grounded metrics.

MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine

TL;DR

Abstract

MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine

Authors

TL;DR

Abstract

Table of Contents

Figures (2)