MABViT -- Modified Attention Block Enhances Vision Transformers

Mahesh Ramesh; Aswinkumar Ramkumar

MABViT -- Modified Attention Block Enhances Vision Transformers

Mahesh Ramesh, Aswinkumar Ramkumar

TL;DR

The paper tackles the challenge that parallel within-block processing in Vision Transformers can underperform standard architectures at common vision scales due to representational collapse. It introduces MABViT, which injects non-linearity into the attention path by applying a GLU-based activation to the Value tensor, and also evaluates a GELU variant. Empirical results on ImageNet-1K show that GLU-based MABViT variants achieve higher accuracy with fewer parameters than ViT baselines, including a 0.6% gain for S/16 over the standard ViT, and that parameter-reduced GLU versions maintain or improve efficiency. The work also finds that MABViT variants scale better in deeper transformers, reducing training instability and accelerating convergence. Overall, the approach offers a practical route to more parameter-efficient, deeper Vision Transformers.

Abstract

Recent studies have demonstrated the effectiveness of Gated Linear Units (GLU) in enhancing transformer models, particularly in Large Language Models (LLMs). Additionally, utilizing a parallel configuration within each Transformer block rather than the conventional serialized method has been revealed to accelerate the training of LLMs without significantly impacting performance. However, when the MLP and attention block were run in parallel for the image classification task, we observed a noticeable decline in performance. We propose a novel transformer variant that integrates non-linearity within the attention block to tackle this problem. We implemented the GLU-based activation function on the Value tensor, and this new technique surpasses the current state-of-the-art S/16 variant of Vision Transformers by 0.6% on the ImageNet-1K dataset while utilizing fewer parameters. It also supersedes the B/16 variant while using only half the parameters. Furthermore, we provide results with the GELU activation function variant to confirm our assertions. Lastly, we showcase that the MABViT variants exhibit greater potential when utilized in deep transformers compared to the standard architecture.

MABViT -- Modified Attention Block Enhances Vision Transformers

TL;DR

Abstract

Paper Structure (29 sections, 14 equations, 5 figures)

This paper contains 29 sections, 14 equations, 5 figures.

Introduction
Background
Transformers
Pre-LayerNormalization Transformer
Post-LayerNormalization Transformer
Vision Transformers
Parallel Structure
Representational Collapse
Gated Linear Units
Related Work
Methodology
Standard Pre-LN Computation
Parallel Pre-LN Computation
Standard Attention Block
Modified Attention Block
...and 14 more sections

Figures (5)

Figure 1: Scaled Dot Product Attention
Figure 2: Modified Scaled Dot Product Attention
Figure 3: Validation accuracy progression of the Baseline S/16 18 Layers variant over 90,000 training steps.
Figure 4: Validation accuracy trajectory of the MABViT PR-GLU S/16 18 Layers variant over 90,000 training steps.
Figure 5: Difference between validation accuracy of MABViT PR-GLU-Base S/16 18L vs Base S/16 18L over 90,000 training steps

MABViT -- Modified Attention Block Enhances Vision Transformers

TL;DR

Abstract

MABViT -- Modified Attention Block Enhances Vision Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (5)