Table of Contents
Fetching ...

LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition

Youbing Hu, Yun Cheng, Anqi Lu, Zhiqiang Cao, Dawei Wei, Jie Liu, Zhijun Li

TL;DR

LF-ViT tackles spatial redundancy in Vision Transformers by introducing a two-stage Localization and Focus framework that first processes a down-sampled image and, if needed, uses Neighborhood Global Class Attention to identify a class-discriminative region in the full-resolution image for focused processing. The approach reuses non-discriminative tokens and fuses discriminative-region features, with shared network parameters enabling end-to-end optimization. On ImageNet with a DeiT-S backbone, LF-ViT reduces FLOPs by up to 63% and doubles practical throughput while maintaining comparable accuracy, demonstrating a practical path to efficient high-resolution ViT inference. This work highlights the value of region-focused computation for accelerating transformer-based image recognition without sacrificing performance.

Abstract

The Vision Transformer (ViT) excels in accuracy when handling high-resolution images, yet it confronts the challenge of significant spatial redundancy, leading to increased computational and memory requirements. To address this, we present the Localization and Focus Vision Transformer (LF-ViT). This model operates by strategically curtailing computational demands without impinging on performance. In the Localization phase, a reduced-resolution image is processed; if a definitive prediction remains elusive, our pioneering Neighborhood Global Class Attention (NGCA) mechanism is triggered, effectively identifying and spotlighting class-discriminative regions based on initial findings. Subsequently, in the Focus phase, this designated region is used from the original image to enhance recognition. Uniquely, LF-ViT employs consistent parameters across both phases, ensuring seamless end-to-end optimization. Our empirical tests affirm LF-ViT's prowess: it remarkably decreases Deit-S's FLOPs by 63\% and concurrently amplifies throughput twofold. Code of this project is at https://github.com/edgeai1/LF-ViT.git.

LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition

TL;DR

LF-ViT tackles spatial redundancy in Vision Transformers by introducing a two-stage Localization and Focus framework that first processes a down-sampled image and, if needed, uses Neighborhood Global Class Attention to identify a class-discriminative region in the full-resolution image for focused processing. The approach reuses non-discriminative tokens and fuses discriminative-region features, with shared network parameters enabling end-to-end optimization. On ImageNet with a DeiT-S backbone, LF-ViT reduces FLOPs by up to 63% and doubles practical throughput while maintaining comparable accuracy, demonstrating a practical path to efficient high-resolution ViT inference. This work highlights the value of region-focused computation for accelerating transformer-based image recognition without sacrificing performance.

Abstract

The Vision Transformer (ViT) excels in accuracy when handling high-resolution images, yet it confronts the challenge of significant spatial redundancy, leading to increased computational and memory requirements. To address this, we present the Localization and Focus Vision Transformer (LF-ViT). This model operates by strategically curtailing computational demands without impinging on performance. In the Localization phase, a reduced-resolution image is processed; if a definitive prediction remains elusive, our pioneering Neighborhood Global Class Attention (NGCA) mechanism is triggered, effectively identifying and spotlighting class-discriminative regions based on initial findings. Subsequently, in the Focus phase, this designated region is used from the original image to enhance recognition. Uniquely, LF-ViT employs consistent parameters across both phases, ensuring seamless end-to-end optimization. Our empirical tests affirm LF-ViT's prowess: it remarkably decreases Deit-S's FLOPs by 63\% and concurrently amplifies throughput twofold. Code of this project is at https://github.com/edgeai1/LF-ViT.git.
Paper Structure (16 sections, 13 equations, 7 figures, 7 tables)

This paper contains 16 sections, 13 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Examples of LF-ViT. FLOPs refer to the proportion of the computation required by LF-ViT (e.g., down-sampled to 112$\times$112) versus processing the entire 224$\times$224 input image.
  • Figure 2: Overview of LF-ViT: (1) Input images are down-sampled and embedded using a consistent patch embedding for both the down-sampled and original image. (2) The down-sampled image undergoes ViT processing for localization. (3) If localization lacks a confident prediction, the Neighborhood Global Class Attention (NGCA) mechanism pinpoints class-discriminative regions in the original image. (4) The top-K tokens with peak global class attention (GCA) from these regions are used for focused recognition. Feature fusion and token reuse mechanisms optimize computation in the focus stage.
  • Figure 3: Illustration of our LF-ViT class-discriminative region identification, localization and feature fuse. The red numbers indicate the global class attention (GCA) of tokens. The green number indicates the region with the maximum neighborhood global class attention (NGCA), and we will select its corresponding region as the class-discriminative region.
  • Figure 4: Comparison between our LF-ViT and existing early-exiting methods. LF-ViT obtains good efficiency/accuracy tradeoffs compared with other ViTs. DVT wang2021not, CF-ViT cf_vit and our LF-ViT are built upon DeiT.
  • Figure 5: Performance analysis of removing each of the four designs.
  • ...and 2 more figures