Table of Contents
Fetching ...

Diff3Dformer: Leveraging Slice Sequence Diffusion for Enhanced 3D CT Classification with Transformer Networks

Zihao Jin, Yingying Fang, Jiahao Huang, Caiwen Xu, Simon Walsh, Guang Yang

TL;DR

Diff3Dformer addresses 3D CT classification on small medical image datasets by leveraging diffusion-based slice representations to form informative slice sequences. It combines a DDIM-based diffusion autoencoder with a Clustering ViT that uses prototype-based clustering attention, reducing attention complexity from $O(N^2)$ to $O(NK)$ and enabling robust 3D learning with limited data. An interpretable slice-sequence fusion computes the patient score $R$ via $R=\sum_{k=1}^{K} A_k \overline{r}_k q_k$, linking predictions to specific clusters for explainability. Empirical results on CC-CCII and FLD show that Diff3Dformer outperforms both CNN-based andTransformer-based baselines across small datasets, highlighting its potential for real-world clinical deployment with limited data.

Abstract

The manifestation of symptoms associated with lung diseases can vary in different depths for individual patients, highlighting the significance of 3D information in CT scans for medical image classification. While Vision Transformer has shown superior performance over convolutional neural networks in image classification tasks, their effectiveness is often demonstrated on sufficiently large 2D datasets and they easily encounter overfitting issues on small medical image datasets. To address this limitation, we propose a Diffusion-based 3D Vision Transformer (Diff3Dformer), which utilizes the latent space of the Diffusion model to form the slice sequence for 3D analysis and incorporates clustering attention into ViT to aggregate repetitive information within 3D CT scans, thereby harnessing the power of the advanced transformer in 3D classification tasks on small datasets. Our method exhibits improved performance on two different scales of small datasets of 3D lung CT scans, surpassing the state of the art 3D methods and other transformer-based approaches that emerged during the COVID-19 pandemic, demonstrating its robust and superior performance across different scales of data. Experimental results underscore the superiority of our proposed method, indicating its potential for enhancing medical image classification tasks in real-world scenarios.

Diff3Dformer: Leveraging Slice Sequence Diffusion for Enhanced 3D CT Classification with Transformer Networks

TL;DR

Diff3Dformer addresses 3D CT classification on small medical image datasets by leveraging diffusion-based slice representations to form informative slice sequences. It combines a DDIM-based diffusion autoencoder with a Clustering ViT that uses prototype-based clustering attention, reducing attention complexity from to and enabling robust 3D learning with limited data. An interpretable slice-sequence fusion computes the patient score via , linking predictions to specific clusters for explainability. Empirical results on CC-CCII and FLD show that Diff3Dformer outperforms both CNN-based andTransformer-based baselines across small datasets, highlighting its potential for real-world clinical deployment with limited data.

Abstract

The manifestation of symptoms associated with lung diseases can vary in different depths for individual patients, highlighting the significance of 3D information in CT scans for medical image classification. While Vision Transformer has shown superior performance over convolutional neural networks in image classification tasks, their effectiveness is often demonstrated on sufficiently large 2D datasets and they easily encounter overfitting issues on small medical image datasets. To address this limitation, we propose a Diffusion-based 3D Vision Transformer (Diff3Dformer), which utilizes the latent space of the Diffusion model to form the slice sequence for 3D analysis and incorporates clustering attention into ViT to aggregate repetitive information within 3D CT scans, thereby harnessing the power of the advanced transformer in 3D classification tasks on small datasets. Our method exhibits improved performance on two different scales of small datasets of 3D lung CT scans, surpassing the state of the art 3D methods and other transformer-based approaches that emerged during the COVID-19 pandemic, demonstrating its robust and superior performance across different scales of data. Experimental results underscore the superiority of our proposed method, indicating its potential for enhancing medical image classification tasks in real-world scenarios.

Paper Structure

This paper contains 17 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: (A) The overview framework of Diff3Dformer. (B) The diffusion autoencoder is leveraged to learn a semantically meaningful representation by learning to reconstruct the 2D slice from a 512-dimensional representation and being used to represent CT volumes as a sequence of representations as the input of the clustering ViT model. (C) The slice fusion module provides final patient decisions and explanations of Diff3Dformer.
  • Figure 2: The comparison results of different methods on CC-CCII and FLD datasets
  • Figure 3: The heatmap represents the contribution of the cluster to the final patient-level risk score $R$ on the FLD dataset. Patients ranked from highest to lowest risk score $R$ on the horizontal axis from the left to right and 64 clusters on the vertical axis.
  • Figure 4: Cluster ranking by contribution to the 'mortality in one year' class on the FLD dataset.
  • Figure 5: Visualization of the representative slices of high-risk clusters on the FLD dataset. The representative slices are those closest to the centroids of the cluster.