S-E Pipeline: A Vision Transformer (ViT) based Resilient Classification Pipeline for Medical Imaging Against Adversarial Attacks
Neha A S, Vivek Chaturvedi, Muhammad Shafique
TL;DR
This paper addresses adversarial vulnerability in Vision Transformer's medical imaging classification by introducing the Segmentation-Enhancement (S-E) Pipeline. The approach combines ROI segmentation via a U-Net with image enhancement techniques (CLAHE, Unsharp Masking, and High-Frequency Emphasis) as a preprocessing layer before ViT classification, and evaluates robustness using FGSM and PGD attacks. Empirical results show substantial reductions in attack impact, notably up to 72.22%/86.58% for FGSM on ViT-b32 and ViT-l32 respectively, and up to 36.25%/80.26% for PGD, with additional validation on CNNs and hardware deployment on the NVIDIA Jetson Orin Nano. The work demonstrates practical, edge-device-friendly defenses for medical imaging, enabling more reliable automated diagnoses in resource-constrained environments.
Abstract
Vision Transformer (ViT) is becoming widely popular in automating accurate disease diagnosis in medical imaging owing to its robust self-attention mechanism. However, ViTs remain vulnerable to adversarial attacks that may thwart the diagnosis process by leading it to intentional misclassification of critical disease. In this paper, we propose a novel image classification pipeline, namely, S-E Pipeline, that performs multiple pre-processing steps that allow ViT to be trained on critical features so as to reduce the impact of input perturbations by adversaries. Our method uses a combination of segmentation and image enhancement techniques such as Contrast Limited Adaptive Histogram Equalization (CLAHE), Unsharp Masking (UM), and High-Frequency Emphasis filtering (HFE) as preprocessing steps to identify critical features that remain intact even after adversarial perturbations. The experimental study demonstrates that our novel pipeline helps in reducing the effect of adversarial attacks by 72.22% for the ViT-b32 model and 86.58% for the ViT-l32 model. Furthermore, we have shown an end-to-end deployment of our proposed method on the NVIDIA Jetson Orin Nano board to demonstrate its practical use case in modern hand-held devices that are usually resource-constrained.
