Table of Contents
Fetching ...

Effect of Patch Size on Fine-Tuning Vision Transformers in Two-Dimensional and Three-Dimensional Medical Image Classification

Massoud Dehghan, Ramona Woitek, Amirreza Mahbod

TL;DR

A thorough evaluation of how different patch sizes affect ViT classification performance is conducted, and a straightforward ensemble strategy that fuses the predictions of the models trained with patch sizes 1, 2, and 4 demonstrates a further boost in performance in most cases.

Abstract

Vision Transformers (ViTs) and their variants have become state-of-the-art in many computer vision tasks and are widely used as backbones in large-scale vision and vision-language foundation models. While substantial research has focused on architectural improvements, the impact of patch size, a crucial initial design choice in ViTs, remains underexplored, particularly in medical domains where both two-dimensional (2D) and three-dimensional (3D) imaging modalities exist. In this study, using 12 medical imaging datasets from various imaging modalities (including seven 2D and five 3D datasets), we conduct a thorough evaluation of how different patch sizes affect ViT classification performance. Using a single graphical processing unit (GPU) and a range of patch sizes (1, 2, 4, 7, 14, 28), we fine-tune ViT models and observe consistent improvements in classification performance with smaller patch sizes (1, 2, and 4), which achieve the best results across nearly all datasets. More specifically, our results indicate improvements in balanced accuracy of up to 12.78% for 2D datasets (patch size 2 vs. 28) and up to 23.78% for 3D datasets (patch size 1 vs. 14), at the cost of increased computational expense. Moreover, by applying a straightforward ensemble strategy that fuses the predictions of the models trained with patch sizes 1, 2, and 4, we demonstrate a further boost in performance in most cases, especially for the 2D datasets. Our implementation is publicly available on GitHub: https://github.com/HealMaDe/MedViT

Effect of Patch Size on Fine-Tuning Vision Transformers in Two-Dimensional and Three-Dimensional Medical Image Classification

TL;DR

A thorough evaluation of how different patch sizes affect ViT classification performance is conducted, and a straightforward ensemble strategy that fuses the predictions of the models trained with patch sizes 1, 2, and 4 demonstrates a further boost in performance in most cases.

Abstract

Vision Transformers (ViTs) and their variants have become state-of-the-art in many computer vision tasks and are widely used as backbones in large-scale vision and vision-language foundation models. While substantial research has focused on architectural improvements, the impact of patch size, a crucial initial design choice in ViTs, remains underexplored, particularly in medical domains where both two-dimensional (2D) and three-dimensional (3D) imaging modalities exist. In this study, using 12 medical imaging datasets from various imaging modalities (including seven 2D and five 3D datasets), we conduct a thorough evaluation of how different patch sizes affect ViT classification performance. Using a single graphical processing unit (GPU) and a range of patch sizes (1, 2, 4, 7, 14, 28), we fine-tune ViT models and observe consistent improvements in classification performance with smaller patch sizes (1, 2, and 4), which achieve the best results across nearly all datasets. More specifically, our results indicate improvements in balanced accuracy of up to 12.78% for 2D datasets (patch size 2 vs. 28) and up to 23.78% for 3D datasets (patch size 1 vs. 14), at the cost of increased computational expense. Moreover, by applying a straightforward ensemble strategy that fuses the predictions of the models trained with patch sizes 1, 2, and 4, we demonstrate a further boost in performance in most cases, especially for the 2D datasets. Our implementation is publicly available on GitHub: https://github.com/HealMaDe/MedViT
Paper Structure (12 sections, 6 figures, 13 tables)

This paper contains 12 sections, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Overview of the 12 selected datasets from the MedMNIST V2 collection used in this study. Each sub-figure shows some representative example images along with the dataset name, imaging modality, and classification task. For the 3D datasets, the middle slices from random samples are shown.
  • Figure 2: Example of splitting the image into patches with different sizes for 2D models.
  • Figure 3: Example of splitting the image into patches with different sizes for 3D models.
  • Figure 4: Comparison of attention maps for patch sizes of 2 and 28 for some example images. The figure shows original images with patch grids (left three columns) and corresponding attention heatmaps (right two columns). Models with smaller patch sizes (P2) consistently achieve correct predictions with more focused attention patterns, while larger patch sizes (P28) lead to incorrect predictions with less informative attention distributions.
  • Figure 5: Effect of patch size on vision transformer classification performance across 12 medical image datasets (seven 2D and five 3D datasets from the MedMNIST V2 collection)
  • ...and 1 more figures