Table of Contents
Fetching ...

ViT3D Alignment of LLaMA3: 3D Medical Image Report Generation

Siyou Li, Beining Xu, Yihao Luo, Dong Nie, Le Zhang

TL;DR

This work employed the 3D Vision Transformer (ViT3D) image encoder introduced from M3D-CLIP to process 3D scans and use the Asclepius-Llama3-8B as the language model to generate the text reports by auto-regressive decoding, outperforming the baseline model.

Abstract

Automatic medical report generation (MRG), which aims to produce detailed text reports from medical images, has emerged as a critical task in this domain. MRG systems can enhance radiological workflows by reducing the time and effort required for report writing, thereby improving diagnostic efficiency. In this work, we present a novel approach for automatic MRG utilizing a multimodal large language model. Specifically, we employed the 3D Vision Transformer (ViT3D) image encoder introduced from M3D-CLIP to process 3D scans and use the Asclepius-Llama3-8B as the language model to generate the text reports by auto-regressive decoding. The experiment shows our model achieved an average Green score of 0.3 on the MRG task validation set and an average accuracy of 0.61 on the visual question answering (VQA) task validation set, outperforming the baseline model. Our approach demonstrates the effectiveness of the ViT3D alignment of LLaMA3 for automatic MRG and VQA tasks by tuning the model on a small dataset.

ViT3D Alignment of LLaMA3: 3D Medical Image Report Generation

TL;DR

This work employed the 3D Vision Transformer (ViT3D) image encoder introduced from M3D-CLIP to process 3D scans and use the Asclepius-Llama3-8B as the language model to generate the text reports by auto-regressive decoding, outperforming the baseline model.

Abstract

Automatic medical report generation (MRG), which aims to produce detailed text reports from medical images, has emerged as a critical task in this domain. MRG systems can enhance radiological workflows by reducing the time and effort required for report writing, thereby improving diagnostic efficiency. In this work, we present a novel approach for automatic MRG utilizing a multimodal large language model. Specifically, we employed the 3D Vision Transformer (ViT3D) image encoder introduced from M3D-CLIP to process 3D scans and use the Asclepius-Llama3-8B as the language model to generate the text reports by auto-regressive decoding. The experiment shows our model achieved an average Green score of 0.3 on the MRG task validation set and an average accuracy of 0.61 on the visual question answering (VQA) task validation set, outperforming the baseline model. Our approach demonstrates the effectiveness of the ViT3D alignment of LLaMA3 for automatic MRG and VQA tasks by tuning the model on a small dataset.

Paper Structure

This paper contains 8 sections, 2 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Example of Medical Report Generation (MRG) results from our model.
  • Figure 2: An overview of our ViT3D Alignment of LLaMA3 approach. The vision encoder is integrated to embed the raw image data into a feature map and then feed the feature map into a ViT module to turn the feature map into a sequence of embedding. The embedding are then concatenated with the prompt embedding and passed to a Large Language Model to generate the next token of pending-generated text.
  • Figure 3: The curve of training loss for MRG (left) and VQA (right).