Table of Contents
Fetching ...

LL-ViT: Edge Deployable Vision Transformers with Look Up Table Neurons

Shashank Nag, Alan T. L. Bacellar, Zachary Susskind, Anshul Jha, Logan Liberty, Aishwarya Sivakumar, Eugene B. John, Krishnan Kailas, Priscila M. V. Lima, Neeraja J. Yadwadkar, Felipe M. G. Franca, Lizy K. John

TL;DR

This work addresses the challenge of deploying vision transformers on edge devices by reducing the dominant memory and compute demands of the MLP-based channel mixer. It introduces LL-ViT, which replaces the channel mixer with a learnable LUT-based block trained end-to-end, and co-designs an FPGA accelerator that keeps weights on-chip and eliminates off-chip memory traffic. LL-ViT achieves comparable accuracies to a baseline INT8 ViT on CIFAR-10/100 and Tiny-ImageNet while reducing model size by over 60% and delivering up to 1.9× energy efficiency and 1.3× lower latency, with higher throughput under a 10.9 W budget. The combination of learned LUTs, thermometer encoding, and a differentiable conditional summation layer enables multiplication-free, hardware-friendly channel mixing, offering a practical path to edge-enabled tiny transformers.

Abstract

Vision Transformers have been tremendously successful in computer vision tasks. However, their large computational, memory, and energy demands are a challenge for edge inference on FPGAs -- a field that has seen a recent surge in demand. We recognize the benefits of recent works on logic and Look Up Table (LUT) based networks, such as LogicNets, NeuraLUT, DWN, among others, in offering models that simultaneously reduce both the memory and compute footprints. However, these models natively do not perform well on common vision tasks, such as CIFAR-10/100. In this work, we propose LL-ViT, a novel edge optimized vision transformer design that integrates layers of LUT neurons within the transformer architecture. Based on our characterization that reveals that a majority of model weights and computations are from the channel mixer (MLP layer), we design an alternate LUT-based channel mixer, and simultaneously develop an FPGA-based accelerator for LL-ViT. Contrary to some attempts to replace each multiplication with a table lookup, our architecture utilizes a neural learning approach which natively learns the LUT functions. This approach allows for reduced model sizes, and a computational and energy-efficient inference solution for vision transformer models. Evaluating on edge-suitable workloads, we achieve accuracies of 95.5% on CIFAR-10, 78.8% on CIFAR-100, and 60.9% on Tiny-ImageNet datasets, comparable to the baseline transformer. LL-ViT eliminates over 60% of the model weights and 50% of the multiplications in the model, and achieves 1.9x energy efficiency and 1.3x lower latency over an integer quantized ViT accelerator, while also offering superior throughput against prior works at a 10.9W power budget.

LL-ViT: Edge Deployable Vision Transformers with Look Up Table Neurons

TL;DR

This work addresses the challenge of deploying vision transformers on edge devices by reducing the dominant memory and compute demands of the MLP-based channel mixer. It introduces LL-ViT, which replaces the channel mixer with a learnable LUT-based block trained end-to-end, and co-designs an FPGA accelerator that keeps weights on-chip and eliminates off-chip memory traffic. LL-ViT achieves comparable accuracies to a baseline INT8 ViT on CIFAR-10/100 and Tiny-ImageNet while reducing model size by over 60% and delivering up to 1.9× energy efficiency and 1.3× lower latency, with higher throughput under a 10.9 W budget. The combination of learned LUTs, thermometer encoding, and a differentiable conditional summation layer enables multiplication-free, hardware-friendly channel mixing, offering a practical path to edge-enabled tiny transformers.

Abstract

Vision Transformers have been tremendously successful in computer vision tasks. However, their large computational, memory, and energy demands are a challenge for edge inference on FPGAs -- a field that has seen a recent surge in demand. We recognize the benefits of recent works on logic and Look Up Table (LUT) based networks, such as LogicNets, NeuraLUT, DWN, among others, in offering models that simultaneously reduce both the memory and compute footprints. However, these models natively do not perform well on common vision tasks, such as CIFAR-10/100. In this work, we propose LL-ViT, a novel edge optimized vision transformer design that integrates layers of LUT neurons within the transformer architecture. Based on our characterization that reveals that a majority of model weights and computations are from the channel mixer (MLP layer), we design an alternate LUT-based channel mixer, and simultaneously develop an FPGA-based accelerator for LL-ViT. Contrary to some attempts to replace each multiplication with a table lookup, our architecture utilizes a neural learning approach which natively learns the LUT functions. This approach allows for reduced model sizes, and a computational and energy-efficient inference solution for vision transformer models. Evaluating on edge-suitable workloads, we achieve accuracies of 95.5% on CIFAR-10, 78.8% on CIFAR-100, and 60.9% on Tiny-ImageNet datasets, comparable to the baseline transformer. LL-ViT eliminates over 60% of the model weights and 50% of the multiplications in the model, and achieves 1.9x energy efficiency and 1.3x lower latency over an integer quantized ViT accelerator, while also offering superior throughput against prior works at a 10.9W power budget.

Paper Structure

This paper contains 15 sections, 2 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: The MLP layers contribute over 60% of the overall model weights and 55% of the overall multiply-accumulate (MAC) operations in the DeiT-T vision transformer deit.
  • Figure 2: Proposed Learned-LUT based Vision Transformer (LL-ViT) - overview of a single encoder layer in the model
  • Figure 3: A typical vision transformer model consisting of a stack of encoder blocks. Figure adapted from vit.
  • Figure 4: LUT or RAM-node Neuron : The input sequence is concatenated and used to "look up" the output in the LUT -- with no MAC operations involved.
  • Figure 5: Proposed Learned LUT-based Vision Transformer Design -- (a) Overall Design, (b) a LUT-based Channel Mixer within the encoder block, (c) the conditional summation layer implementation for a particular channel. This is repeated in each encoder.
  • ...and 4 more figures