Table of Contents
Fetching ...

Shrinking the Giant : Quasi-Weightless Transformers for Low Energy Inference

Shashank Nag, Alan T. L. Bacellar, Zachary Susskind, Anshul Jha, Logan Liberty, Aishwarya Sivakumar, Eugene B. John, Krishnan Kailas, Priscila M. V. Lima, Neeraja J. Yadwadkar, Felipe M. G. Franca, Lizy K. John

TL;DR

This work extends the functions of the Multi Layer Perceptron (MLP) layers in transformer models and integrating them with transformers to propose Quasi Weightless Transformers (QuWeiT), a computational and energy-efficient inference solution for transformer-based models.

Abstract

Transformers are set to become ubiquitous with applications ranging from chatbots and educational assistants to visual recognition and remote sensing. However, their increasing computational and memory demands is resulting in growing energy consumption. Building models with fast and energy-efficient inference is imperative to enable a variety of transformer-based applications. Look Up Table (LUT) based Weightless Neural Networks are faster than the conventional neural networks as their inference only involves a few lookup operations. Recently, an approach for learning LUT networks directly via an Extended Finite Difference method was proposed. We build on this idea, extending it for performing the functions of the Multi Layer Perceptron (MLP) layers in transformer models and integrating them with transformers to propose Quasi Weightless Transformers (QuWeiT). This allows for a computational and energy-efficient inference solution for transformer-based models. On I-ViT-T, we achieve a comparable accuracy of 95.64% on CIFAR-10 dataset while replacing approximately 55% of all the multiplications in the entire model and achieving a 2.2x energy efficiency. We also observe similar savings on experiments with the nanoGPT framework.

Shrinking the Giant : Quasi-Weightless Transformers for Low Energy Inference

TL;DR

This work extends the functions of the Multi Layer Perceptron (MLP) layers in transformer models and integrating them with transformers to propose Quasi Weightless Transformers (QuWeiT), a computational and energy-efficient inference solution for transformer-based models.

Abstract

Transformers are set to become ubiquitous with applications ranging from chatbots and educational assistants to visual recognition and remote sensing. However, their increasing computational and memory demands is resulting in growing energy consumption. Building models with fast and energy-efficient inference is imperative to enable a variety of transformer-based applications. Look Up Table (LUT) based Weightless Neural Networks are faster than the conventional neural networks as their inference only involves a few lookup operations. Recently, an approach for learning LUT networks directly via an Extended Finite Difference method was proposed. We build on this idea, extending it for performing the functions of the Multi Layer Perceptron (MLP) layers in transformer models and integrating them with transformers to propose Quasi Weightless Transformers (QuWeiT). This allows for a computational and energy-efficient inference solution for transformer-based models. On I-ViT-T, we achieve a comparable accuracy of 95.64% on CIFAR-10 dataset while replacing approximately 55% of all the multiplications in the entire model and achieving a 2.2x energy efficiency. We also observe similar savings on experiments with the nanoGPT framework.

Paper Structure

This paper contains 22 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Distribution of model parameters and MAC operations between the MLP and other layers in common Transformer-based models. The MLP layers contribute over 60% of the overall model weights and about 50-70% of the overall MAC operations.
  • Figure 2: Proposed Quasi-Weightless Transformer Design - overview of a single encoder/ decoder layer in the model
  • Figure 3: A typical transformer model architecture with an encoder-only or decoder-only stack of layers. In the case of decoder-only models, the MSA shown here is a masked self-attention, with future tokens masked during attention score computation.
  • Figure 4: (a) Conventional Neuron : Each neuron multiplies inputs with weights and adds them. (b) Binary Neural Network Neuron : The weights being binary, the multiplication operation is substituted by a XNOR (c) Weightless Neuron : In contrast, the input sequence is concatenated and used to "look up" in the LUT with no MAC operations involved.
  • Figure 5: Differentiable Weightless Block that replaces the MLP
  • ...and 3 more figures