Table of Contents
Fetching ...

Advanced Chest X-Ray Analysis via Transformer-Based Image Descriptors and Cross-Model Attention Mechanism

Lakshita Agarwal, Bindu Verma

TL;DR

The paper addresses automatic generation of chest X-ray descriptions by integrating a Vision Transformer based encoder with cross-modal attention and a GPT-4 decoder. The CrossViT-GPT4 architecture grounds visual features to language, achieving state-of-the-art results on IU and NIH chest X-ray datasets across BLEU, CIDEr, METEOR, and ROUGE-L metrics. By combining robust visual representations with a powerful language model, it improves clinical relevance and description quality while highlighting challenges such as data scarcity and image quality. The approach has potential to streamline radiology reporting and support clinical decision making.

Abstract

The examination of chest X-ray images is a crucial component in detecting various thoracic illnesses. This study introduces a new image description generation model that integrates a Vision Transformer (ViT) encoder with cross-modal attention and a GPT-4-based transformer decoder. The ViT captures high-quality visual features from chest X-rays, which are fused with text data through cross-modal attention to improve the accuracy, context, and richness of image descriptions. The GPT-4 decoder transforms these fused features into accurate and relevant captions. The model was tested on the National Institutes of Health (NIH) and Indiana University (IU) Chest X-ray datasets. On the IU dataset, it achieved scores of 0.854 (B-1), 0.883 (CIDEr), 0.759 (METEOR), and 0.712 (ROUGE-L). On the NIH dataset, it achieved the best performance on all metrics: BLEU 1--4 (0.825, 0.788, 0.765, 0.752), CIDEr (0.857), METEOR (0.726), and ROUGE-L (0.705). This framework has the potential to enhance chest X-ray evaluation, assisting radiologists in more precise and efficient diagnosis.

Advanced Chest X-Ray Analysis via Transformer-Based Image Descriptors and Cross-Model Attention Mechanism

TL;DR

The paper addresses automatic generation of chest X-ray descriptions by integrating a Vision Transformer based encoder with cross-modal attention and a GPT-4 decoder. The CrossViT-GPT4 architecture grounds visual features to language, achieving state-of-the-art results on IU and NIH chest X-ray datasets across BLEU, CIDEr, METEOR, and ROUGE-L metrics. By combining robust visual representations with a powerful language model, it improves clinical relevance and description quality while highlighting challenges such as data scarcity and image quality. The approach has potential to streamline radiology reporting and support clinical decision making.

Abstract

The examination of chest X-ray images is a crucial component in detecting various thoracic illnesses. This study introduces a new image description generation model that integrates a Vision Transformer (ViT) encoder with cross-modal attention and a GPT-4-based transformer decoder. The ViT captures high-quality visual features from chest X-rays, which are fused with text data through cross-modal attention to improve the accuracy, context, and richness of image descriptions. The GPT-4 decoder transforms these fused features into accurate and relevant captions. The model was tested on the National Institutes of Health (NIH) and Indiana University (IU) Chest X-ray datasets. On the IU dataset, it achieved scores of 0.854 (B-1), 0.883 (CIDEr), 0.759 (METEOR), and 0.712 (ROUGE-L). On the NIH dataset, it achieved the best performance on all metrics: BLEU 1--4 (0.825, 0.788, 0.765, 0.752), CIDEr (0.857), METEOR (0.726), and ROUGE-L (0.705). This framework has the potential to enhance chest X-ray evaluation, assisting radiologists in more precise and efficient diagnosis.

Paper Structure

This paper contains 13 sections, 11 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: A representation of the proposed framework's structure: CrossViT-GPT4- During the initial stage, the ViT encoder utilizes cross-model attention to extract high-level visual features from the pre-processed images. GPT-4 decoder uses tokenization to extract individual words from the provided input caption file. It then utilizes the dense layers of the network to produce image descriptions.
  • Figure 2: Cross-Model Attention Mechanism.