Table of Contents
Fetching ...

Vision Transformer for Classification of Breast Ultrasound Images

Behnaz Gheflati, Hassan Rivaz

TL;DR

Breast ultrasound image classification faces limited labeled data and the limited global context of CNNs. This study assesses Vision Transformer (ViT) architectures, using transfer learning from pre-trained ViTs and comparing them to state-of-the-art CNNs on BUSI and dataset B, with accuracy and AUC as evaluation metrics. ViT models achieve over 85% accuracy and ~0.95 AUC, with the small ViT-B/32 often performing best, demonstrating competitive or superior performance to CNN baselines in several cases. The results suggest ViTs can learn global spatial dependencies in US images and offer a viable alternative to CNNs for medical ultrasound classification, particularly when data are limited.

Abstract

Medical ultrasound (US) imaging has become a prominent modality for breast cancer imaging due to its ease-of-use, low-cost and safety. In the past decade, convolutional neural networks (CNNs) have emerged as the method of choice in vision applications and have shown excellent potential in automatic classification of US images. Despite their success, their restricted local receptive field limits their ability to learn global context information. Recently, Vision Transformer (ViT) designs that are based on self-attention between image patches have shown great potential to be an alternative to CNNs. In this study, for the first time, we utilize ViT to classify breast US images using different augmentation strategies. The results are provided as classification accuracy and Area Under the Curve (AUC) metrics, and the performance is compared with the state-of-the-art CNNs. The results indicate that the ViT models have comparable efficiency with or even better than the CNNs in classification of US breast images.

Vision Transformer for Classification of Breast Ultrasound Images

TL;DR

Breast ultrasound image classification faces limited labeled data and the limited global context of CNNs. This study assesses Vision Transformer (ViT) architectures, using transfer learning from pre-trained ViTs and comparing them to state-of-the-art CNNs on BUSI and dataset B, with accuracy and AUC as evaluation metrics. ViT models achieve over 85% accuracy and ~0.95 AUC, with the small ViT-B/32 often performing best, demonstrating competitive or superior performance to CNN baselines in several cases. The results suggest ViTs can learn global spatial dependencies in US images and offer a viable alternative to CNNs for medical ultrasound classification, particularly when data are limited.

Abstract

Medical ultrasound (US) imaging has become a prominent modality for breast cancer imaging due to its ease-of-use, low-cost and safety. In the past decade, convolutional neural networks (CNNs) have emerged as the method of choice in vision applications and have shown excellent potential in automatic classification of US images. Despite their success, their restricted local receptive field limits their ability to learn global context information. Recently, Vision Transformer (ViT) designs that are based on self-attention between image patches have shown great potential to be an alternative to CNNs. In this study, for the first time, we utilize ViT to classify breast US images using different augmentation strategies. The results are provided as classification accuracy and Area Under the Curve (AUC) metrics, and the performance is compared with the state-of-the-art CNNs. The results indicate that the ViT models have comparable efficiency with or even better than the CNNs in classification of US breast images.

Paper Structure

This paper contains 17 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Example of breast US images with three different classifications.
  • Figure 2: Overview of the vision Transformer used in classification of breast US.