Table of Contents
Fetching ...

Maximizing the Position Embedding for Vision Transformers with Global Average Pooling

Wonjun Lee, Bumsub Ham, Suhyun Kim

TL;DR

This work tackles the limited expressiveness of position embeddings in vision transformers when using GAP and Layer-wise structures. It reveals a counterbalancing role for PE across layers and proposes MPVG, which feeds PE into the Last LN to maximize its effectiveness while preserving layer-wise dynamics. Empirical results across image classification, object detection, and semantic segmentation show MPVG consistently outperforms prior methods including PVG, with notable gains on ImageNet-1K (e.g., DeiT-Ti from 72.14% to 73.51%) and CIFAR-100 (ViT-Lite from 74.90% to 76.87%), as well as downstream tasks (COCO AP, ADE20K mIoU). The findings imply that maintaining PE-driven counterbalancing directionality improves ViT performance under GAP, offering a practical, broadly applicable adjustment to PE design in vision transformers.

Abstract

In vision transformers, position embedding (PE) plays a crucial role in capturing the order of tokens. However, in vision transformer structures, there is a limitation in the expressiveness of PE due to the structure where position embedding is simply added to the token embedding. A layer-wise method that delivers PE to each layer and applies independent Layer Normalizations for token embedding and PE has been adopted to overcome this limitation. In this paper, we identify the conflicting result that occurs in a layer-wise structure when using the global average pooling (GAP) method instead of the class token. To overcome this problem, we propose MPVG, which maximizes the effectiveness of PE in a layer-wise structure with GAP. Specifically, we identify that PE counterbalances token embedding values at each layer in a layer-wise structure. Furthermore, we recognize that the counterbalancing role of PE is insufficient in the layer-wise structure, and we address this by maximizing the effectiveness of PE through MPVG. Through experiments, we demonstrate that PE performs a counterbalancing role and that maintaining this counterbalancing directionality significantly impacts vision transformers. As a result, the experimental results show that MPVG outperforms existing methods across vision transformers on various tasks.

Maximizing the Position Embedding for Vision Transformers with Global Average Pooling

TL;DR

This work tackles the limited expressiveness of position embeddings in vision transformers when using GAP and Layer-wise structures. It reveals a counterbalancing role for PE across layers and proposes MPVG, which feeds PE into the Last LN to maximize its effectiveness while preserving layer-wise dynamics. Empirical results across image classification, object detection, and semantic segmentation show MPVG consistently outperforms prior methods including PVG, with notable gains on ImageNet-1K (e.g., DeiT-Ti from 72.14% to 73.51%) and CIFAR-100 (ViT-Lite from 74.90% to 76.87%), as well as downstream tasks (COCO AP, ADE20K mIoU). The findings imply that maintaining PE-driven counterbalancing directionality improves ViT performance under GAP, offering a practical, broadly applicable adjustment to PE design in vision transformers.

Abstract

In vision transformers, position embedding (PE) plays a crucial role in capturing the order of tokens. However, in vision transformer structures, there is a limitation in the expressiveness of PE due to the structure where position embedding is simply added to the token embedding. A layer-wise method that delivers PE to each layer and applies independent Layer Normalizations for token embedding and PE has been adopted to overcome this limitation. In this paper, we identify the conflicting result that occurs in a layer-wise structure when using the global average pooling (GAP) method instead of the class token. To overcome this problem, we propose MPVG, which maximizes the effectiveness of PE in a layer-wise structure with GAP. Specifically, we identify that PE counterbalances token embedding values at each layer in a layer-wise structure. Furthermore, we recognize that the counterbalancing role of PE is insufficient in the layer-wise structure, and we address this by maximizing the effectiveness of PE through MPVG. Through experiments, we demonstrate that PE performs a counterbalancing role and that maintaining this counterbalancing directionality significantly impacts vision transformers. As a result, the experimental results show that MPVG outperforms existing methods across vision transformers on various tasks.

Paper Structure

This paper contains 27 sections, 11 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: The conflicting result between the GAP method and the Layer-wise method. In DeiT-Ti, using the GAP method and the Layer-wise method separately results in performance improvements, but combining these two methods leads to a decrease in performance. As a result, MPVG resolves this phenomenon between the GAP and Layer-wise structure, maximizing the effect of PE.
  • Figure 2: The heatmaps depict the characteristics of each layer in both the original structure and the Layer-wise structure with the GAP method. For the Layer-wise structure, the heatmaps illustrate cases both with and without PE in the Last LN. For each heatmap based on DeiT-Ti, the x-axis represents the dimension of DeiT-Ti (256), and the y-axis represents the number of tokens (196). In both (a) and the top row (token embedding) of (b), the heatmaps represent the average value of token embedding in each layer, while the bottom row of (b) shows the heatmap of PE. The correlation in (b) refers to the correlation coefficient between token embedding and position embedding.
  • Figure 3: The overview of the various methods. (a) ViT. (b) LaPE lape. (c) PVG, an improved Layer-wise structure. Specifically, we adopt a structure where the token embedding and PE are added before entering layer 0 and a hierarchical structure for delivering PE, excluding layer 0. (d) MPVG. The main difference from PVG is whether the initial PE is delivered to the Last LN.
  • Figure 4: Correlation coefficient between token embedding and position embedding in Layer-wise. Each token embedding and position embedding is based on the values after applying LN. DeiT-Ti, DeiT-S, and CeiT-Ti each have a total of 12 layers, but T2T-ViT-7 has 7 layers.
  • Figure 5: Comparison of two methods on DeiT-Ti. (a) Structure with only GAP applied, showing 72.40% performance; and (b) Structure with GAP and position embedding added to the Last LN in a non-Layer-wise structure, also showing 72.14% performance.
  • ...and 3 more figures