Table of Contents
Fetching ...

ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays

Shehroz S. Khan, Petar Przulj, Ahmed Ashraf, Ali Abedi

TL;DR

This work presents ChestGPT, a unified vision-language framework that integrates the EVA ViT visual encoder with the LLaMA 2 LLM to perform both disease classification and localization in chest X-rays. By converting image patches into tokens and feeding them, along with engineered prompts, into a frozen EVA backbone and a fine-tuned LLaMA 2, the model generates text describing global diseases and bounding boxes for local abnormalities. On VinDr-CXR, ChestGPT achieves a strong global F1 of approximately $0.76$ and demonstrates the ability to localize pathologies, with performance influenced by prompt style and initialization. The approach aims to reduce radiologist workload by providing preliminary findings and spatial context, while future work explores more capable vision encoders, broader datasets, external validation, and improved localization metrics.

Abstract

The global demand for radiologists is increasing rapidly due to a growing reliance on medical imaging services, while the supply of radiologists is not keeping pace. Advances in computer vision and image processing technologies present significant potential to address this gap by enhancing radiologists' capabilities and improving diagnostic accuracy. Large language models (LLMs), particularly generative pre-trained transformers (GPTs), have become the primary approach for understanding and generating textual data. In parallel, vision transformers (ViTs) have proven effective at converting visual data into a format that LLMs can process efficiently. In this paper, we present ChestGPT, a deep-learning framework that integrates the EVA ViT with the Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images. The ViT converts X-ray images into tokens, which are then fed, together with engineered prompts, into the LLM, enabling joint classification and localization of diseases. This approach incorporates transfer learning techniques to enhance both explainability and performance. The proposed method achieved strong global disease classification performance on the VinDr-CXR dataset, with an F1 score of 0.76, and successfully localized pathologies by generating bounding boxes around the regions of interest. We also outline several task-specific prompts, in addition to general-purpose prompts, for scenarios radiologists might encounter. Overall, this framework offers an assistive tool that can lighten radiologists' workload by providing preliminary findings and regions of interest to facilitate their diagnostic process.

ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays

TL;DR

This work presents ChestGPT, a unified vision-language framework that integrates the EVA ViT visual encoder with the LLaMA 2 LLM to perform both disease classification and localization in chest X-rays. By converting image patches into tokens and feeding them, along with engineered prompts, into a frozen EVA backbone and a fine-tuned LLaMA 2, the model generates text describing global diseases and bounding boxes for local abnormalities. On VinDr-CXR, ChestGPT achieves a strong global F1 of approximately and demonstrates the ability to localize pathologies, with performance influenced by prompt style and initialization. The approach aims to reduce radiologist workload by providing preliminary findings and spatial context, while future work explores more capable vision encoders, broader datasets, external validation, and improved localization metrics.

Abstract

The global demand for radiologists is increasing rapidly due to a growing reliance on medical imaging services, while the supply of radiologists is not keeping pace. Advances in computer vision and image processing technologies present significant potential to address this gap by enhancing radiologists' capabilities and improving diagnostic accuracy. Large language models (LLMs), particularly generative pre-trained transformers (GPTs), have become the primary approach for understanding and generating textual data. In parallel, vision transformers (ViTs) have proven effective at converting visual data into a format that LLMs can process efficiently. In this paper, we present ChestGPT, a deep-learning framework that integrates the EVA ViT with the Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images. The ViT converts X-ray images into tokens, which are then fed, together with engineered prompts, into the LLM, enabling joint classification and localization of diseases. This approach incorporates transfer learning techniques to enhance both explainability and performance. The proposed method achieved strong global disease classification performance on the VinDr-CXR dataset, with an F1 score of 0.76, and successfully localized pathologies by generating bounding boxes around the regions of interest. We also outline several task-specific prompts, in addition to general-purpose prompts, for scenarios radiologists might encounter. Overall, this framework offers an assistive tool that can lighten radiologists' workload by providing preliminary findings and regions of interest to facilitate their diagnostic process.

Paper Structure

This paper contains 15 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The block diagram of the proposed method, ChestGPT, begins with the input X-ray image, which is divided into patches. The EVA ViT then extracts feature vectors from these patches. These feature vectors are concatenated and passed through linear layers to project them into a new space, forming tokens. This projected data, along with a prompt, is fed into the LLaMA 2 model. LLaMA 2 processes this input to generate textual output that includes the detected disease and the identification of bounding boxes corresponding to the regions of interest associated with the disease. While the pre-trained EVA ViT remains frozen during training, the linear layers and LLaMA 2, which are also pre-trained, are fine-tuned to optimize their performance during the training process.
  • Figure 2: Example chest X-ray images from the VinDr-CXR dataset nguyen2022vindr, showing the detected diseases and the corresponding regions of interest within the images.
  • Figure 3: Qualitative comparison between test labels and model output for other diseases.
  • Figure 4: Qualitative comparison between test labels and model output for multiple abnormalities.
  • Figure 5: Qualitative comparison between test labels and model output showing multiple hallucinations.