Table of Contents
Fetching ...

E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model

Haoran Lai, Zihang Jiang, Qingsong Yao, Rongsheng Wang, Zhiyang He, Xiaodong Tao, Wei Wei, Weifu Lv, S. Kevin Zhou

TL;DR

This work collects a large amount of unlabeled 3D CT data and utilizes self-supervised learning to construct a 3D visual foundation model for extracting 3D visual features, and applies 3D spatial convolutions to aggregate and project high-level image features, reducing computational complexity while preserving spatial information.

Abstract

The development of 3D medical vision-language models holds significant potential for disease diagnosis and patient treatment. However, compared to 2D medical images, 3D medical images, such as CT scans, face challenges related to limited training data and high dimension, which severely restrict the progress of 3D medical vision-language models. To address these issues, we collect a large amount of unlabeled 3D CT data and utilize self-supervised learning to construct a 3D visual foundation model for extracting 3D visual features. Then, we apply 3D spatial convolutions to aggregate and project high-level image features, reducing computational complexity while preserving spatial information. We also construct two instruction-tuning datasets based on BIMCV-R and CT-RATE to fine-tune the 3D vision-language model. Our model demonstrates superior performance compared to existing methods in report generation, visual question answering, and disease diagnosis. Code and data will be made publicly available soon.

E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model

TL;DR

This work collects a large amount of unlabeled 3D CT data and utilizes self-supervised learning to construct a 3D visual foundation model for extracting 3D visual features, and applies 3D spatial convolutions to aggregate and project high-level image features, reducing computational complexity while preserving spatial information.

Abstract

The development of 3D medical vision-language models holds significant potential for disease diagnosis and patient treatment. However, compared to 2D medical images, 3D medical images, such as CT scans, face challenges related to limited training data and high dimension, which severely restrict the progress of 3D medical vision-language models. To address these issues, we collect a large amount of unlabeled 3D CT data and utilize self-supervised learning to construct a 3D visual foundation model for extracting 3D visual features. Then, we apply 3D spatial convolutions to aggregate and project high-level image features, reducing computational complexity while preserving spatial information. We also construct two instruction-tuning datasets based on BIMCV-R and CT-RATE to fine-tune the 3D vision-language model. Our model demonstrates superior performance compared to existing methods in report generation, visual question answering, and disease diagnosis. Code and data will be made publicly available soon.

Paper Structure

This paper contains 21 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The pipline of E3D-GPT. (a) The 3D image encoder is pre-trained by 3D MAE. (b) In the E3D-GPT model, 3D medical images are fed into a pre-trained 3D image encoder and an effective 3D convolution to produce refined embeddings inserted into LLM
  • Figure 2: Qualitative comparisons with different models and ground truth on report generation. Matching colors in both the predictions and the answers indicate corresponding content.
  • Figure 3: Qualitative comparisons with different models and ground truth on VQA. Matching colors in both the predictions and the answers indicate corresponding content.
  • Figure 4: Case study on OOD questions. We test the E3D-GPT trained by CT-RATE-VQA on OOD dialogue, which means that all questions are NOT related to our training data.