A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine

Hanguang Xiao; Feizhong Zhou; Xingyue Liu; Tianqi Liu; Zhipeng Li; Xin Liu; Xiaoxuan Huang

A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine

Hanguang Xiao, Feizhong Zhou, Xingyue Liu, Tianqi Liu, Zhipeng Li, Xin Liu, Xiaoxuan Huang

TL;DR

The paper surveys the rapid rise of LLMs and MLLMs in medicine, tracing the shift from supervised and unsupervised pre-training to prompt- and data-driven paradigms while emphasizing high-quality data. It catalogs architectures (encoder/decoder/encoder-decoder) and modality-alignment strategies that enable vision-text medical reasoning, and it compiles medical datasets, fine-tuning techniques, and evaluation methods. It highlights key medical applications—diagnosis, clinical report generation, education, mental health, and surgical assistance—and candidly discusses challenges such as hallucinations, privacy, recency, and biases, offering practical directions like edge deployment, medical agents, and continual data integration. Together, these insights aim to bridge AI advances and clinical practice, fostering safer, more capable AI-enabled healthcare systems.

Abstract

Since the release of ChatGPT and GPT-4, large language models (LLMs) and multimodal large language models (MLLMs) have attracted widespread attention for their exceptional capabilities in understanding, reasoning, and generation, introducing transformative paradigms for integrating artificial intelligence into medicine. This survey provides a comprehensive overview of the development, principles, application scenarios, challenges, and future directions of LLMs and MLLMs in medicine. Specifically, it begins by examining the paradigm shift, tracing the transition from traditional models to LLMs and MLLMs, and highlighting the unique advantages of these LLMs and MLLMs in medical applications. Next, the survey reviews existing medical LLMs and MLLMs, providing detailed guidance on their construction and evaluation in a clear and systematic manner. Subsequently, to underscore the substantial value of LLMs and MLLMs in healthcare, the survey explores five promising applications in the field. Finally, the survey addresses the challenges confronting medical LLMs and MLLMs and proposes practical strategies and future directions for their integration into medicine. In summary, this survey offers a comprehensive analysis of the technical methodologies and practical clinical applications of medical LLMs and MLLMs, with the goal of bridging the gap between these advanced technologies and clinical practice, thereby fostering the evolution of the next generation of intelligent healthcare systems.

A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine

TL;DR

Abstract

Paper Structure (39 sections, 4 equations, 10 figures, 3 tables)

This paper contains 39 sections, 4 equations, 10 figures, 3 tables.

Introduction
Background of LLMs and MLLMs
Supervised Learning
Unsupervised Pre-training and Fine-tuning
Unsupervised Pre-training and Prompt
Text-only to Multimodal
High-quality Data
Structure of LLMs and MLLMs
Structure of LLMs
Encoder-only
Decoder-only
Encoder-Decoder
Structure of MLLMs
Vision Encoder
LLM Backbone
...and 24 more sections

Figures (10)

Figure 1: The process of constructing and evaluating medical LLMs and MLLMs.
Figure 2: The overall structure of the survey. Section \ref{['sec2']} to Section \ref{['sec4']} are biased toward principles of medical LLMs and MLLMs; Section \ref{['Sec5']} to Section \ref{['Sec7']} are biased toward practical clinical applications.
Figure 3: Evolution of LLMs and MLLMs. The Evolution of LLMs and MLLMs. The upper section illustrates the research focuses and paradigm shifts across the evolution of these models, while the lower section highlights key milestones achieved at each stage.
Figure 4: The core modules and pipeline of MLLMs. On the far right are three types of modality alignment modules. The approach of utilizing expert models to construct MLLMs is regarded as a type of prompt augmentation method, classified under modality alignment modules for further elaboration.
Figure 5: Overview of six fine-tuning methods. In our analysis of the related work on medical LLMs and MLLMs, we found that Continuous Pre-Training (CPT) is commonly used to inject medical knowledge into LLMs and MLLMs; Instruction Fine-Tuning (IFT) enhances the models' ability to follow instructions and their zero-shot performance; Supervised Fine-Tuning is frequently employed to improve model performance on specific tasks; and Reinforcement Learning from Human Feedback (RLHF), Reinforcement Learning from AI Feedback (RLAIF), and Direct Preference Optimization (DPO) are used to align model behavior with human preferences.
...and 5 more figures

A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine

TL;DR

Abstract

A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine

Authors

TL;DR

Abstract

Table of Contents

Figures (10)