VariViT: A Vision Transformer for Variable Image Sizes

Aswathi Varma; Suprosanna Shit; Chinmay Prabhakar; Daniel Scholz; Hongwei Bran Li; Bjoern Menze; Daniel Rueckert; Benedikt Wiestler

VariViT: A Vision Transformer for Variable Image Sizes

Aswathi Varma, Suprosanna Shit, Chinmay Prabhakar, Daniel Scholz, Hongwei Bran Li, Bjoern Menze, Daniel Rueckert, Benedikt Wiestler

TL;DR

VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size, is proposed, which surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification and reduces computation time by up to 30% compared to conventional architectures.

Abstract

Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artefacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1-scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning. Our code can be found here: https://github.com/Aswathi-Varma/varivit

VariViT: A Vision Transformer for Variable Image Sizes

TL;DR

Abstract

Paper Structure (11 sections, 1 equation, 9 figures, 3 tables)

This paper contains 11 sections, 1 equation, 9 figures, 3 tables.

Introduction
Related Work
Method
Experimental Setup
Results
Discussion and Conclusion
Datasets
Bounding Box Distribution
Absolute v/s Relative Position Embedding
Cosine Similarity
t-SNE Plots

Figures (9)

Figure 1: Selecting the optimal input crop size is essential for maximizing the representation quality (i.e., attention level) of ViTs. (a): 2D slice of the large tumor bounding box crop. (b): Attention map of the vanillaViT model on the image, showing attention to background data rather than the tumor. (c): The image resized to a large fixed size shifts focus to distortions arising from the operation. (d): (Ours) Smaller image crop without resizing. In our method, attention is mainly given to desired tumor regions.
Figure 2: The VariViT model addresses the problem of handling images with different sizes by introducing a novel positional embedding resizing mechanism and employing different batching strategies. The model utilizes a fixed patch size, ensuring consistent patch embedding sizes across images while simultaneously adapting to different sequence lengths using a center and select resizing strategy.
Figure 3: (a): Tumor bounding boxes of two sizes are displayed, with the tumor center of mass aligned. (b): Our proposed center and select method for resizing position embeddings initializes fixed sinusoidal embeddings for the largest image size. Embeddings for other sizes are selected from the center of this grid. (c): We present two batching strategies: Custom Batch Sampler (CBS) with the same image sizes and Gradient Accumulation (GA) with varying image sizes within a batch.
Figure 4: t-SNE visualization of embedding layer output for IDH status classification. (a): Vanilla ViT, (b): Pix2Struct and (c): VariViT-GA, all with an embedding dimension of 384. Notice the clearer separation of clusters in our model's plot.
Figure 5: Comparison of positional embedding strategies using the VariViT-GA model.
...and 4 more figures

VariViT: A Vision Transformer for Variable Image Sizes

TL;DR

Abstract

VariViT: A Vision Transformer for Variable Image Sizes

Authors

TL;DR

Abstract

Table of Contents

Figures (9)