Table of Contents
Fetching ...

Activator: GLU Activation Function as the Core Component of a Vision Transformer

Abdullah Nazhat Abdullah, Tarkan Aydin

TL;DR

The paper introduces Activator, a GLU-based MLP block intended to replace both the attention mechanism and the standard MLP in vision transformers to reduce compute. By leveraging local token interactions through GLU MLPs, Activator serves as a compact, single-block core and is evaluated under an equalized setting against ViT and MLP-Mixer on CIFAR-10/100, showing competitive or superior accuracy (e.g., 73.2%/46.1% vs. 65.7%/34.9% and 70.1%/39.2%). Ablation studies across GEGLU, SwiGLU, and ReGLU variants reveal robustness of performance to activation choice. The work validates that GLU-based MLPs can fulfill the roles of both key-value mapping and information gating commonly realized by attention, opening a path to lighter transformer designs for CV tasks and beyond.

Abstract

The transformer architecture has driven many successes in a variety of tasks within the field of deep learning, in particular the recent advances in natural language processing (NLP) culminating with large language models (LLM). Adding to that success, transformer architecture has found widespread interest from computer vision (CV) researchers and practitioners, allowing for many advancements in vision-related tasks and opening the door for multitask and multi-modal deep learning architectures that share the same principle of operation. One drawback to these architectures is their reliance on the scaled dot product attention mechanism with the softmax activation function, which is computationally expensive and requires large compute capabilities for both training and inference. This paper investigates substituting the MLP and attention mechanism usually adopted for transformer architecture with an architecture based on incorporating a gated linear unit (GLU) activation function structure with the aim of reducing the computational cost. The equalized experimental assessments conducted in this work show that the proposed modification with the targeted reductions in computational complexity offers competitive performance compared to the selected baseline architectures. The results are significantly in support of the aims of this work, in which the focus was to extensively utilize GLU-based MLPs, establishing a more efficient but capable alternative to the traditional MLP and the attention mechanism as the core component in the design of transformer architectures.

Activator: GLU Activation Function as the Core Component of a Vision Transformer

TL;DR

The paper introduces Activator, a GLU-based MLP block intended to replace both the attention mechanism and the standard MLP in vision transformers to reduce compute. By leveraging local token interactions through GLU MLPs, Activator serves as a compact, single-block core and is evaluated under an equalized setting against ViT and MLP-Mixer on CIFAR-10/100, showing competitive or superior accuracy (e.g., 73.2%/46.1% vs. 65.7%/34.9% and 70.1%/39.2%). Ablation studies across GEGLU, SwiGLU, and ReGLU variants reveal robustness of performance to activation choice. The work validates that GLU-based MLPs can fulfill the roles of both key-value mapping and information gating commonly realized by attention, opening a path to lighter transformer designs for CV tasks and beyond.

Abstract

The transformer architecture has driven many successes in a variety of tasks within the field of deep learning, in particular the recent advances in natural language processing (NLP) culminating with large language models (LLM). Adding to that success, transformer architecture has found widespread interest from computer vision (CV) researchers and practitioners, allowing for many advancements in vision-related tasks and opening the door for multitask and multi-modal deep learning architectures that share the same principle of operation. One drawback to these architectures is their reliance on the scaled dot product attention mechanism with the softmax activation function, which is computationally expensive and requires large compute capabilities for both training and inference. This paper investigates substituting the MLP and attention mechanism usually adopted for transformer architecture with an architecture based on incorporating a gated linear unit (GLU) activation function structure with the aim of reducing the computational cost. The equalized experimental assessments conducted in this work show that the proposed modification with the targeted reductions in computational complexity offers competitive performance compared to the selected baseline architectures. The results are significantly in support of the aims of this work, in which the focus was to extensively utilize GLU-based MLPs, establishing a more efficient but capable alternative to the traditional MLP and the attention mechanism as the core component in the design of transformer architectures.
Paper Structure (8 sections, 8 equations, 4 figures, 2 tables, 3 algorithms)

This paper contains 8 sections, 8 equations, 4 figures, 2 tables, 3 algorithms.

Figures (4)

  • Figure 1: An illustration of the Activator mechanism.
  • Figure 2: A diagrammatic comparison of the Activator architectures with the ViT architecture.
  • Figure 3: An illustration of the accuracy curves for Activator architecture.
  • Figure 4: An illustration of the loss curves for Activator architecture.