Table of Contents
Fetching ...

MABViT -- Modified Attention Block Enhances Vision Transformers

Mahesh Ramesh, Aswinkumar Ramkumar

TL;DR

The paper tackles the challenge that parallel within-block processing in Vision Transformers can underperform standard architectures at common vision scales due to representational collapse. It introduces MABViT, which injects non-linearity into the attention path by applying a GLU-based activation to the Value tensor, and also evaluates a GELU variant. Empirical results on ImageNet-1K show that GLU-based MABViT variants achieve higher accuracy with fewer parameters than ViT baselines, including a 0.6% gain for S/16 over the standard ViT, and that parameter-reduced GLU versions maintain or improve efficiency. The work also finds that MABViT variants scale better in deeper transformers, reducing training instability and accelerating convergence. Overall, the approach offers a practical route to more parameter-efficient, deeper Vision Transformers.

Abstract

Recent studies have demonstrated the effectiveness of Gated Linear Units (GLU) in enhancing transformer models, particularly in Large Language Models (LLMs). Additionally, utilizing a parallel configuration within each Transformer block rather than the conventional serialized method has been revealed to accelerate the training of LLMs without significantly impacting performance. However, when the MLP and attention block were run in parallel for the image classification task, we observed a noticeable decline in performance. We propose a novel transformer variant that integrates non-linearity within the attention block to tackle this problem. We implemented the GLU-based activation function on the Value tensor, and this new technique surpasses the current state-of-the-art S/16 variant of Vision Transformers by 0.6% on the ImageNet-1K dataset while utilizing fewer parameters. It also supersedes the B/16 variant while using only half the parameters. Furthermore, we provide results with the GELU activation function variant to confirm our assertions. Lastly, we showcase that the MABViT variants exhibit greater potential when utilized in deep transformers compared to the standard architecture.

MABViT -- Modified Attention Block Enhances Vision Transformers

TL;DR

The paper tackles the challenge that parallel within-block processing in Vision Transformers can underperform standard architectures at common vision scales due to representational collapse. It introduces MABViT, which injects non-linearity into the attention path by applying a GLU-based activation to the Value tensor, and also evaluates a GELU variant. Empirical results on ImageNet-1K show that GLU-based MABViT variants achieve higher accuracy with fewer parameters than ViT baselines, including a 0.6% gain for S/16 over the standard ViT, and that parameter-reduced GLU versions maintain or improve efficiency. The work also finds that MABViT variants scale better in deeper transformers, reducing training instability and accelerating convergence. Overall, the approach offers a practical route to more parameter-efficient, deeper Vision Transformers.

Abstract

Recent studies have demonstrated the effectiveness of Gated Linear Units (GLU) in enhancing transformer models, particularly in Large Language Models (LLMs). Additionally, utilizing a parallel configuration within each Transformer block rather than the conventional serialized method has been revealed to accelerate the training of LLMs without significantly impacting performance. However, when the MLP and attention block were run in parallel for the image classification task, we observed a noticeable decline in performance. We propose a novel transformer variant that integrates non-linearity within the attention block to tackle this problem. We implemented the GLU-based activation function on the Value tensor, and this new technique surpasses the current state-of-the-art S/16 variant of Vision Transformers by 0.6% on the ImageNet-1K dataset while utilizing fewer parameters. It also supersedes the B/16 variant while using only half the parameters. Furthermore, we provide results with the GELU activation function variant to confirm our assertions. Lastly, we showcase that the MABViT variants exhibit greater potential when utilized in deep transformers compared to the standard architecture.
Paper Structure (29 sections, 14 equations, 5 figures)

This paper contains 29 sections, 14 equations, 5 figures.

Figures (5)

  • Figure 1: Scaled Dot Product Attention
  • Figure 2: Modified Scaled Dot Product Attention
  • Figure 3: Validation accuracy progression of the Baseline S/16 18 Layers variant over 90,000 training steps.
  • Figure 4: Validation accuracy trajectory of the MABViT PR-GLU S/16 18 Layers variant over 90,000 training steps.
  • Figure 5: Difference between validation accuracy of MABViT PR-GLU-Base S/16 18L vs Base S/16 18L over 90,000 training steps