Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding
Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, Jin Hao, Zijian Chen, Ruijia Wu, Tao Tang, Junhui Lv, Hongxia Xu, Hongwei Wang, Jun Xiao, Bin Feng, Fudong Zhu, Kenli Li, Weidi Xie, Jimeng Sun, Jian Wu, Zuozhu Liu
TL;DR
Hulu-Med addresses the need for holistic clinical multimodal understanding by unifying text, 2D/3D imaging, and video within a single, transparent architecture. It employs a patch-based, rotary-position–adaptive visual encoder with 2D RoPE extended to 3D/video, a multimodal projector, an LLM decoder, and a three-stage progressive training regime augmented with synthetic data and a medical-aware token reduction that achieves up to 55% token pruning. The formal objective combines visual and textual streams into a single autoregressive model: $y = \Phi([g(f_v(\mathbf{v})); f_t(\mathbf{t})])$, enabling text-only or multimodal generation; this framework is trained on a public, 16.7M-sample corpus spanning 12 anatomical systems and 14 modalities, with full data and code release to ensure reproducibility. Empirically, Hulu-Med achieves leading open-source performance across 30 medical benchmarks, including 2D/3D/VQA, report generation, multilingual dialogue, and rare-disease diagnosis, while approaching or surpassing several proprietary systems on many tasks. This work provides a scalable, transparent blueprint for holistic medical VLMs and demonstrates the practicality of public-data–driven, end-to-end medical AI with broad clinical potential.
Abstract
Real-world clinical decision-making requires integrating heterogeneous data, including medical text, 2D images, 3D volumes, and videos, while existing AI systems fail to unify all these signals, limiting their utility. In this paper, we introduce Hulu-Med, a transparent, generalist medical Vision-Language Model (VLM) designed to unify language-only, 2D/3D vision-language, and video understanding within a single architecture. Hulu-Med is trained on a curated corpus of 16.7 million samples, comprising exclusively public or synthetic data, spanning 12 major anatomical systems and 14 medical imaging modalities. Hulu-Med employs a medical-aware token-reduction strategy that prunes redundant visual tokens, achieving up to a 55% reduction for 3D and video inputs, improving cross-modal efficiency, and enabling training at 7B-32B parameter scales in approximately 4,000-40,000 GPU hours. Across 30 public in-domain and out-of-domain medical benchmarks-covering text reasoning, visual question answering, report generation, multilingual dialogue, video understanding, and rare disease diagnosis-Hulu-Med surpasses existing open-source models on 27 of 30 benchmarks and outperforms proprietary systems such as GPT-4o on 16 benchmarks. Despite being a VLM, Hulu-Med outperforms GPT-4o and matches GPT-o1 on the text-only HealthBench. For the first time in the community, we provide a fully transparent, reproducible and cost-effective pipeline for holistic medical vision-language understanding by releasing our end-to-end data curation, training procedures, and model parameters. Code and models are available at https://github.com/ZJUI-AI4H/Hulu-Med.
