Table of Contents
Fetching ...

ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine

Junying Chen, Zhenyang Cai, Zhiheng Liu, Yunjin Yang, Rongsheng Wang, Qingying Xiao, Xiangyi Feng, Zhan Su, Jing Guo, Xiang Wan, Guangjun Yu, Haizhou Li, Benyou Wang

TL;DR

<3-5 sentence high-level summary>ShizhenGPT introduces the first multimodal LLM tailored for Traditional Chinese Medicine, addressing critical data scarcity and the need to integrate visual, auditory, olfactory, and pulse-based diagnostics. The authors curate an expansive TCM-focused dataset and employ a two-stage pretraining plus instruction-tuning pipeline, using LLM backbones augmented with vision and signal encoders. Empirical results show strong TCM expertise, leading performance on TCM-vision tasks, and robust multimodal perception across modalities, with competitive results against larger proprietary models. The work provides public datasets, benchmarks, and code to catalyze future research in holistic AI for TCM diagnostics.

Abstract

Despite the success of large language models (LLMs) in various domains, their potential in Traditional Chinese Medicine (TCM) remains largely underexplored due to two critical barriers: (1) the scarcity of high-quality TCM data and (2) the inherently multimodal nature of TCM diagnostics, which involve looking, listening, smelling, and pulse-taking. These sensory-rich modalities are beyond the scope of conventional LLMs. To address these challenges, we present ShizhenGPT, the first multimodal LLM tailored for TCM. To overcome data scarcity, we curate the largest TCM dataset to date, comprising 100GB+ of text and 200GB+ of multimodal data, including 1.2M images, 200 hours of audio, and physiological signals. ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning. For evaluation, we collect recent national TCM qualification exams and build a visual benchmark for Medicinal Recognition and Visual Diagnosis. Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models. Moreover, it leads in TCM visual understanding among existing multimodal LLMs and demonstrates unified perception across modalities like sound, pulse, smell, and vision, paving the way toward holistic multimodal perception and diagnosis in TCM. Datasets, models, and code are publicly available. We hope this work will inspire further exploration in this field.

ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine

TL;DR

<3-5 sentence high-level summary>ShizhenGPT introduces the first multimodal LLM tailored for Traditional Chinese Medicine, addressing critical data scarcity and the need to integrate visual, auditory, olfactory, and pulse-based diagnostics. The authors curate an expansive TCM-focused dataset and employ a two-stage pretraining plus instruction-tuning pipeline, using LLM backbones augmented with vision and signal encoders. Empirical results show strong TCM expertise, leading performance on TCM-vision tasks, and robust multimodal perception across modalities, with competitive results against larger proprietary models. The work provides public datasets, benchmarks, and code to catalyze future research in holistic AI for TCM diagnostics.

Abstract

Despite the success of large language models (LLMs) in various domains, their potential in Traditional Chinese Medicine (TCM) remains largely underexplored due to two critical barriers: (1) the scarcity of high-quality TCM data and (2) the inherently multimodal nature of TCM diagnostics, which involve looking, listening, smelling, and pulse-taking. These sensory-rich modalities are beyond the scope of conventional LLMs. To address these challenges, we present ShizhenGPT, the first multimodal LLM tailored for TCM. To overcome data scarcity, we curate the largest TCM dataset to date, comprising 100GB+ of text and 200GB+ of multimodal data, including 1.2M images, 200 hours of audio, and physiological signals. ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning. For evaluation, we collect recent national TCM qualification exams and build a visual benchmark for Medicinal Recognition and Visual Diagnosis. Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models. Moreover, it leads in TCM visual understanding among existing multimodal LLMs and demonstrates unified perception across modalities like sound, pulse, smell, and vision, paving the way toward holistic multimodal perception and diagnosis in TCM. Datasets, models, and code are publicly available. We hope this work will inspire further exploration in this field.

Paper Structure

This paper contains 67 sections, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Key Capabilities of ShizhenGPT, a multimodal LLM for Traditional Chinese Medicine (TCM).
  • Figure 2: Overview of ShizhenGPT. (a) Pre-training process, with the loss curve of ShizhenGPT-7B. (b) Post-training process with multimodal instruction tuning. (c) Model architecture. (d) Demonstration of ShizhenGPT's capabilities.
  • Figure 3: Results of the human evaluation. ShizhenGPT refers to ShizhenGPT-32B. "Win/Tie/Loss" indicates the proportion of expert preferences for the model responses (Details are in Appendix \ref{['app:humaneval']}).
  • Figure 4: Examples responses from ShizhenGPT-32B. Full outputs are provided in Appendix \ref{['app:case']}.
  • Figure 5: The prompt for rating the quality of TCM documents. Here, {TCM Paragraph} represents the TCM text to be scored.
  • ...and 8 more figures