Faster Inference of Integer SWIN Transformer by Removing the GELU Activation
Mohammadreza Tayaranian, Seyyed Hasan Mozafari, James J. Clark, Brett Meyer, Warren Gross
TL;DR
The paper tackles the high inference latency of the SWIN Transformer caused by windowed attention and the GELU non-linearity, which complicates quantization. It introduces GELU-less SWIN by replacing GELU with ReLU, removing the GELU fused operation and bias, and applying iterative knowledge distillation before post-training int8 quantization. The main contributions are a hardware-aware quantization workflow that yields at least an 11% reduction in latency on an RTX 4090 across SWIN configurations while keeping ImageNet top-1 accuracy loss under 0.5%. This approach demonstrates practical speedups for vision transformers on real hardware, combining non-linear activation simplification with distillation-based accuracy recovery. These results suggest a viable path for deploying fast, quantized vision transformers in latency-constrained applications.
Abstract
SWIN transformer is a prominent vision transformer model that has state-of-the-art accuracy in image classification tasks. Despite this success, its unique architecture causes slower inference compared with similar deep neural networks. Integer quantization of the model is one of the methods used to improve its inference latency. However, state-of-the-art has not been able to fully quantize the model. In this work, we improve upon the inference latency of the state-of-the-art methods by removing the floating-point operations, which are associated with the GELU activation in Swin Transformer. While previous work proposed to replace the non-integer operations with linear approximation functions, we propose to replace GELU with ReLU activation. The advantage of ReLU over previous methods is its low memory and computation complexity. We use iterative knowledge distillation to compensate for the lost accuracy due to replacing GELU with ReLU. We quantize our GELU-less SWIN transformer and show that on an RTX 4090 NVIDIA GPU we can improve the inference latency of the quantized SWIN transformer by at least $11\%$ while maintaining an accuracy drop of under $0.5\%$ on the ImageNet evaluation dataset.
