Table of Contents
Fetching ...

Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

Qiuhui Chen, Huping Ye, Yi Hong

TL;DR

A novel pre-training framework called Med3DInsight, which marries existing 3D image encoders with 2D MLLMs and bridges them via a designed Plane-Slice-Aware Transformer (PSAT) module, which improves the 3D medical image understanding by a good margin.

Abstract

Understanding 3D medical image volumes is a critical task in the medical domain. However, existing 3D convolution and transformer-based methods have limited semantic understanding of an image volume and also need a large set of volumes for training. Recent advances in multi-modal large language models (MLLMs) provide a new and promising way to understand images with the help of text descriptions. However, most current MLLMs are designed for 2D natural images. To enhance the 3D medical image understanding with 2D MLLMs, we propose a novel pre-training framework called Med3DInsight, which marries existing 3D image encoders with 2D MLLMs and bridges them via a designed Plane-Slice-Aware Transformer (PSAT) module. Extensive experiments demonstrate our SOTA performance on two downstream segmentation and classification tasks, including three public datasets with CT and MRI modalities and comparison to more than ten baselines. Med3DInsight can be easily integrated into any current 3D medical image understanding network and improves its performance by a good margin.

Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

TL;DR

A novel pre-training framework called Med3DInsight, which marries existing 3D image encoders with 2D MLLMs and bridges them via a designed Plane-Slice-Aware Transformer (PSAT) module, which improves the 3D medical image understanding by a good margin.

Abstract

Understanding 3D medical image volumes is a critical task in the medical domain. However, existing 3D convolution and transformer-based methods have limited semantic understanding of an image volume and also need a large set of volumes for training. Recent advances in multi-modal large language models (MLLMs) provide a new and promising way to understand images with the help of text descriptions. However, most current MLLMs are designed for 2D natural images. To enhance the 3D medical image understanding with 2D MLLMs, we propose a novel pre-training framework called Med3DInsight, which marries existing 3D image encoders with 2D MLLMs and bridges them via a designed Plane-Slice-Aware Transformer (PSAT) module. Extensive experiments demonstrate our SOTA performance on two downstream segmentation and classification tasks, including three public datasets with CT and MRI modalities and comparison to more than ten baselines. Med3DInsight can be easily integrated into any current 3D medical image understanding network and improves its performance by a good margin.
Paper Structure (6 sections, 3 figures, 3 tables)

This paper contains 6 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of our Med3DInsight framework. During pre-training (Top), we train a 3D image encoder and a plan-slice-aware transformer that aligns 3D image features with 2D slices and text features extracted from 2D MLLMs. The pre-trained 3D encoder is further fine-tuned in the downstream tasks, including 3D segmentation (Bottom Left) and 3D classification (Bottom Right).
  • Figure 2: Visual comparison of 3D segmentation results predicted by multiple methods and enhanced by our Med3DInsight.
  • Figure 3: Ablation study of PSAT. QTrans denotes the Query Transformer, and PSPE denotes the Plane-Slice Position Embedding.