Table of Contents
Fetching ...

Vision Transformer Computation and Resilience for Dynamic Inference

Kavya Sreedhar, Jason Clemons, Rangharajan Venkatesan, Stephen W. Keckler, Mark Horowitz

TL;DR

This paper tackles dynamic, resource-constrained inference for vision transformers used in semantic segmentation and object detection. It reveals that convolutions, not attention, dominate FLOPs and GPU runtime in modern models, and demonstrates that dynamic execution paths—through pruning and switching between pretrained/retrained models—can substantially reduce energy and latency (e.g., up to ~28% energy for SegFormer with negligible accuracy loss, and ~53% energy for ResNet-50 with modest accuracy loss) without retraining in some cases. The authors profile both GPUs and MAGNet accelerators to identify viable alternative paths, and establish design principles for selecting and exploiting these paths, including prioritizing convolutional blocks and decoder pruning. Their results show that CNN-accelerator-friendly execution is essential for efficient dynamic vision-transformer inference, and that resilience to pruning varies across architectures, with segmentation models often benefiting from retrained-path switches while CNN backbones (OFA-ResNet-50) can be highly scalable with pretrained switching. Overall, the work provides practical guidance and demonstrations for deploying dynamic inference in real-time vision systems, combining hardware-aware profiling with architecture-aware pruning and model-switching strategies.

Abstract

State-of-the-art deep learning models for computer vision tasks are based on the transformer architecture and often deployed in real-time applications. In this scenario, the resources available for every inference can vary, so it is useful to be able to dynamically adapt execution to trade accuracy for efficiency. To create dynamic models, we leverage the resilience of vision transformers to pruning and switch between different scaled versions of a model. Surprisingly, we find that most FLOPs are generated by convolutions, not attention. These relative FLOP counts are not a good predictor of GPU performance since GPUs have special optimizations for convolutions. Some models are fairly resilient and their model execution can be adapted without retraining, while all models achieve better accuracy with retraining alternative execution paths. These insights mean that we can leverage CNN accelerators and these alternative execution paths to enable efficient and dynamic vision transformer inference. Our analysis shows that leveraging this type of dynamic execution can lead to saving 28\% of energy with a 1.4\% accuracy drop for SegFormer (63 GFLOPs), with no additional training, and 53\% of energy for ResNet-50 (4 GFLOPs) with a 3.3\% accuracy drop by switching between pretrained Once-For-All models.

Vision Transformer Computation and Resilience for Dynamic Inference

TL;DR

This paper tackles dynamic, resource-constrained inference for vision transformers used in semantic segmentation and object detection. It reveals that convolutions, not attention, dominate FLOPs and GPU runtime in modern models, and demonstrates that dynamic execution paths—through pruning and switching between pretrained/retrained models—can substantially reduce energy and latency (e.g., up to ~28% energy for SegFormer with negligible accuracy loss, and ~53% energy for ResNet-50 with modest accuracy loss) without retraining in some cases. The authors profile both GPUs and MAGNet accelerators to identify viable alternative paths, and establish design principles for selecting and exploiting these paths, including prioritizing convolutional blocks and decoder pruning. Their results show that CNN-accelerator-friendly execution is essential for efficient dynamic vision-transformer inference, and that resilience to pruning varies across architectures, with segmentation models often benefiting from retrained-path switches while CNN backbones (OFA-ResNet-50) can be highly scalable with pretrained switching. Overall, the work provides practical guidance and demonstrations for deploying dynamic inference in real-time vision systems, combining hardware-aware profiling with architecture-aware pruning and model-switching strategies.

Abstract

State-of-the-art deep learning models for computer vision tasks are based on the transformer architecture and often deployed in real-time applications. In this scenario, the resources available for every inference can vary, so it is useful to be able to dynamically adapt execution to trade accuracy for efficiency. To create dynamic models, we leverage the resilience of vision transformers to pruning and switch between different scaled versions of a model. Surprisingly, we find that most FLOPs are generated by convolutions, not attention. These relative FLOP counts are not a good predictor of GPU performance since GPUs have special optimizations for convolutions. Some models are fairly resilient and their model execution can be adapted without retraining, while all models achieve better accuracy with retraining alternative execution paths. These insights mean that we can leverage CNN accelerators and these alternative execution paths to enable efficient and dynamic vision transformer inference. Our analysis shows that leveraging this type of dynamic execution can lead to saving 28\% of energy with a 1.4\% accuracy drop for SegFormer (63 GFLOPs), with no additional training, and 53\% of energy for ResNet-50 (4 GFLOPs) with a 3.3\% accuracy drop by switching between pretrained Once-For-All models.
Paper Structure (20 sections, 13 figures, 3 tables)

This paper contains 20 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: FLOPs and NVIDIA RTX A5000 GPU execution time in convolutions (dots) and ResNet-50 backbone (dashed lines) for inference with DETR carion2020end (orange), Conditional DETR meng2021-CondDETR (green), DAB DETR liu2022dabdetr (blue), and Anchor DETR wang2022anchor (yellow). For larger image sizes, convolutions dominate FLOPs but not GPU execution time.
  • Figure 2: Layers in SegFormer xie2021segformer model. The Swin Transformer liu2021swin model follows the same high-level structure, with a more optimized attention module and the UPerNet decoder head xiao2018unified. The UPerNet decoder has a layer similar to SegFormer's Conv2DFuse, which we refer to as fpn_bottleneck_Conv2D.
  • Figure 3: FLOPs distribution across SegFormer ADE B2 model layers and Swin Tiny model layers for inference with a 512 by 512 input image size.
  • Figure 4: Image pixels versus NVIDIA RTX A5000 GPU execution time spent on convolutions for inference with the SegFormer ADE B2 (blue), SegFormer City B2 (orange), Swin Tiny (gray), Swin Small (yellow), and Swin Base models (green).
  • Figure 5: Parameterizable MAGNet venkatesan2019magnet accelerator and PE architecture templates.
  • ...and 8 more figures