Bitformer: An efficient Transformer with bitwise operation-based attention for Big Data Analytics at low-cost low-precision devices

Gaoxiang Duan; Junkai Zhang; Xiaoying Zheng; Yongxin Zhu; Victor Chang

Bitformer: An efficient Transformer with bitwise operation-based attention for Big Data Analytics at low-cost low-precision devices

Gaoxiang Duan, Junkai Zhang, Xiaoying Zheng, Yongxin Zhu, Victor Chang

TL;DR

The Bitformer model in essence endeavors to reconcile the indomitable requirements of modern computing landscapes with the constraints posed by edge computing scenarios, bridge the gap between high-performing models and resource-scarce environments, thus unveiling a promising trajectory for further advancements in the field.

Abstract

In the current landscape of large models, the Transformer stands as a cornerstone, playing a pivotal role in shaping the trajectory of modern models. However, its application encounters challenges attributed to the substantial computational intricacies intrinsic to its attention mechanism. Moreover, its reliance on high-precision floating-point operations presents specific hurdles, particularly evident in computation-intensive scenarios such as edge computing environments. These environments, characterized by resource-constrained devices and a preference for lower precision, necessitate innovative solutions. To tackle the exacting data processing demands posed by edge devices, we introduce the Bitformer model, an inventive extension of the Transformer paradigm. Central to this innovation is a novel attention mechanism that adeptly replaces conventional floating-point matrix multiplication with bitwise operations. This strategic substitution yields dual advantages. Not only does it maintain the attention mechanism's prowess in capturing intricate long-range information dependencies, but it also orchestrates a profound reduction in the computational complexity inherent in the attention operation. The transition from an $O(n^2d)$ complexity, typical of floating-point operations, to an $O(n^2T)$ complexity characterizing bitwise operations, substantiates this advantage. Notably, in this context, the parameter $T$ remains markedly smaller than the conventional dimensionality parameter $d$. The Bitformer model in essence endeavors to reconcile the indomitable requirements of modern computing landscapes with the constraints posed by edge computing scenarios. By forging this innovative path, we bridge the gap between high-performing models and resource-scarce environments, thus unveiling a promising trajectory for further advancements in the field.

Bitformer: An efficient Transformer with bitwise operation-based attention for Big Data Analytics at low-cost low-precision devices

TL;DR

Abstract

complexity, typical of floating-point operations, to an

complexity characterizing bitwise operations, substantiates this advantage. Notably, in this context, the parameter

remains markedly smaller than the conventional dimensionality parameter

. The Bitformer model in essence endeavors to reconcile the indomitable requirements of modern computing landscapes with the constraints posed by edge computing scenarios. By forging this innovative path, we bridge the gap between high-performing models and resource-scarce environments, thus unveiling a promising trajectory for further advancements in the field.

Paper Structure (27 sections, 17 equations, 7 figures, 6 tables, 2 algorithms)

This paper contains 27 sections, 17 equations, 7 figures, 6 tables, 2 algorithms.

Introduction
Related works
Transformer
Efficient Attention Mechanism for Edge Devices
Spike Neuron Network
Prior knowledge
Attention Mechanism
Spike Neurons
Hamming Distance
Method
Bitformer Implementation
Bitformer V.S. Quntification
Data Format Conversion
Bitwise Attention
Complexity
...and 12 more sections

Figures (7)

Figure 1: The Bitformer attention mechanism, which we illustrate using an image input, consists of three steps, aligning with the method description. Our bitwise attention mechanism incorporates two key ideas. In the first step (Step 1), we employ Time Integrate-and-Fire (TIF) to convert the float data $Q_f$ and $K_f$ into binary data $Q_b$ and $K_b$. This enables us to transform the attention operation into a binary operation. In the second step (Step 2), instead of utilizing dot-product, we utilize the Hamming distance to assess the similarity between each token, leveraging the XOR operation. By concatenating the distance scores for all time steps, we complete the attention operation and transition back to the real number field. Our approach can be applied to various domains, and here we exemplify its application using an image of size $W\times H\times d$ to illustrate its functionality.
Figure 2: Comparing Hamming distance to Dot product for binary data. The standard float operations provide the richest information, but also require the most expensive compute consumption. On the other hand, SNN-based methods use spike operations like IF to convert data into binary space, which is power-friendly but lacks a lot of information. Our method involves converting a single float data into a combination of a series of binaries data, which reduces compute consumption while minimizing information loss.
Figure 3: Comparing 8bit Hamming distance with 8bit Dot product in circuit level.
Figure 4: Compare the power and accuracy on ImageNet. Our method accomplished a well trade-off between power and accuracy.
Figure 5: Compare the performance between attention and our bitwise attention on FPGA.
...and 2 more figures

Bitformer: An efficient Transformer with bitwise operation-based attention for Big Data Analytics at low-cost low-precision devices

TL;DR

Abstract

Bitformer: An efficient Transformer with bitwise operation-based attention for Big Data Analytics at low-cost low-precision devices

Authors

TL;DR

Abstract

Table of Contents

Figures (7)