Table of Contents
Fetching ...

A Survey of Medical Vision-and-Language Applications and Their Techniques

Qi Chen, Ruoshan Zhao, Sinuo Wang, Vu Minh Hieu Phan, Anton van den Hengel, Johan Verjans, Zhibin Liao, Minh-Son To, Yong Xia, Jian Chen, Yutong Xie, Qi Wu

TL;DR

A detailed analysis of various vision-and-language model architectures, focusing on their distinct strategies for cross-modal integration/exploitation of medical visual and textual features, is conducted.

Abstract

Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Their applications are versatile and have the potential to improve diagnostic accuracy and decision-making for individual patients while also contributing to enhanced public health monitoring, disease surveillance, and policy-making through more efficient analysis of large data sets. MVLMS integrate natural language processing with medical images to enable a more comprehensive and contextual understanding of medical images alongside their corresponding textual information. Unlike general vision-and-language models trained on diverse, non-specialized datasets, MVLMs are purpose-built for the medical domain, automatically extracting and interpreting critical information from medical images and textual reports to support clinical decision-making. Popular clinical applications of MVLMs include automated medical report generation, medical visual question answering, medical multimodal segmentation, diagnosis and prognosis and medical image-text retrieval. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We conduct a detailed analysis of various vision-and-language model architectures, focusing on their distinct strategies for cross-modal integration/exploitation of medical visual and textual features. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics. Furthermore, we highlight potential challenges and summarize future research trends and directions. The full collection of papers and codes is available at: https://github.com/YtongXie/Medical-Vision-and-Language-Tasks-and-Methodologies-A-Survey.

A Survey of Medical Vision-and-Language Applications and Their Techniques

TL;DR

A detailed analysis of various vision-and-language model architectures, focusing on their distinct strategies for cross-modal integration/exploitation of medical visual and textual features, is conducted.

Abstract

Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Their applications are versatile and have the potential to improve diagnostic accuracy and decision-making for individual patients while also contributing to enhanced public health monitoring, disease surveillance, and policy-making through more efficient analysis of large data sets. MVLMS integrate natural language processing with medical images to enable a more comprehensive and contextual understanding of medical images alongside their corresponding textual information. Unlike general vision-and-language models trained on diverse, non-specialized datasets, MVLMs are purpose-built for the medical domain, automatically extracting and interpreting critical information from medical images and textual reports to support clinical decision-making. Popular clinical applications of MVLMs include automated medical report generation, medical visual question answering, medical multimodal segmentation, diagnosis and prognosis and medical image-text retrieval. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We conduct a detailed analysis of various vision-and-language model architectures, focusing on their distinct strategies for cross-modal integration/exploitation of medical visual and textual features. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics. Furthermore, we highlight potential challenges and summarize future research trends and directions. The full collection of papers and codes is available at: https://github.com/YtongXie/Medical-Vision-and-Language-Tasks-and-Methodologies-A-Survey.

Paper Structure

This paper contains 66 sections, 2 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: This diagram illustrates the components of MVLMs, including data inputs like medical images, reports, and graphs, along with preprocessing operations such as data fusion and enhancement. It leverages various model architectures (e.g., transformer and generative networks) for tasks like medical report generation (MRG), visual question answering (VQA), medical image segmentation (MIS), and image-text retrieval (ITR). Applications span diagnosis, surgery planning, and early disease detection, enhancing clinical decision-making and workflows.
  • Figure 2: Statistics on the number of papers published in several top journals and conferences such as IEEE-TMI, Medical Image Analysis, CVPR, ICCV, MICCAI, etc. The plot shows consistent growth in recent literature.
  • Figure 3: Typical framework of medical report generation.
  • Figure 4: Four common ways of report generation: (a) Autoregressive architecture that generates report word by word based on LSTM, Transformer, etc; (b) Using hierarchical decoders that contain a sentence decoder and a word decoder; (c) Refer to sentences in the template database or fill the template sentences with specific information; (d) Generating reports relying on the comprehension and generation capabilities of large language models (LLMs).
  • Figure 5: Methods to leverage domain knowledge in medical report generation: (a) Domain knowledge assists in cross-modal feature fusion during feature extraction; (b) Domain knowledge supports the text generation process directly.
  • ...and 7 more figures