Table of Contents
Fetching ...

FilterViT and DropoutViT

Bohang Sun

TL;DR

An enhanced version of ViT that conducts attention-based QKV operations during the initial stages of downsampling is introduced, which significantly reduces resource consumption while maintaining high performance.

Abstract

In this study, we introduce an enhanced version of ViT that conducts attention-based QKV operations during the initial stages of downsampling. Performing attention directly on high-resolution feature maps is computationally demanding due to the large size and numerous tokens. To mitigate this, we propose a filter attention mechanism that uses a Filter Block to create a salient mask (Filter Mask) for selecting the most informative pixels for attention. The Filter Block scores the pixels of the feature map, and we sort these scores to retain only the top K pixels (with K varying across layers). This approach effectively decreases the number of tokens involved in the attention computation, reducing computational complexity and boosting processing speed. Furthermore, the salient mask provides interpretability, as the model focuses on regions of the image most critical to the outcome. Our experimental results show that this model improves parameter efficiency and computational speed while enhancing accuracy. Compared to existing models, our approach significantly reduces resource consumption while maintaining high performance.

FilterViT and DropoutViT

TL;DR

An enhanced version of ViT that conducts attention-based QKV operations during the initial stages of downsampling is introduced, which significantly reduces resource consumption while maintaining high performance.

Abstract

In this study, we introduce an enhanced version of ViT that conducts attention-based QKV operations during the initial stages of downsampling. Performing attention directly on high-resolution feature maps is computationally demanding due to the large size and numerous tokens. To mitigate this, we propose a filter attention mechanism that uses a Filter Block to create a salient mask (Filter Mask) for selecting the most informative pixels for attention. The Filter Block scores the pixels of the feature map, and we sort these scores to retain only the top K pixels (with K varying across layers). This approach effectively decreases the number of tokens involved in the attention computation, reducing computational complexity and boosting processing speed. Furthermore, the salient mask provides interpretability, as the model focuses on regions of the image most critical to the outcome. Our experimental results show that this model improves parameter efficiency and computational speed while enhancing accuracy. Compared to existing models, our approach significantly reduces resource consumption while maintaining high performance.

Paper Structure

This paper contains 23 sections, 2 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: An overview of the Filter Attention mechanism. The input image is first processed by a convolutional neural network (CNN) to generate a feature map. A filter mask is then applied to this feature map, which selects the most salient pixels for further attention-based computation. Only the top-ranked pixels, based on the salient scores, are passed through the Transformer Encoder for global attention processing. This reduces the computational complexity by focusing attention on the most relevant regions of the image.
  • Figure 2: Illustration of the Global Self-Attention mechanism with pooling. After the initial feature extraction by the CNN, average pooling is applied to reduce the size of the feature map. This pooling operation significantly decreases the number of tokens passed into the Transformer Encoder for self-attention computation. The remaining tokens retain global context and are processed by the self-attention module, capturing long-range dependencies in the image with reduced computational cost.
  • Figure 3: Validation accuracy for the first img-100 subset.
  • Figure 4: Scatter plot comparing different models in terms of parameter count, accuracy(for first imageNet-100 subset), and inference speed. The x-axis represents the number of parameters (in millions), and the y-axis represents classification accuracy. Each point’s color indicates the inference speed on the CPU (FPS), with a gradient from purple to yellow representing increasing speed. The diameter of each point corresponds to the inference speed on CUDA (FPS), as indicated by the horizontal legend at the top (showing example values of 100, 500, 1000, and 1500 FPS).
  • Figure 5: Visualization of filter masks across three Filter Attention layers. The first layer focuses on edges and general features, the second on the main object, and the third on the background. This layered attention mechanism suggests that FilterMobileVit has inherent explainability.
  • ...and 2 more figures