Table of Contents
Fetching ...

LookupFFN: Making Transformers Compute-lite for CPU inference

Zhanpeng Zeng, Michael Davies, Pranav Pulijala, Karthikeyan Sankaralingam, Vikas Singh

TL;DR

This work proposes an alternative formulation to GEMM based FFNs inspired by the recent studies of using Locality Sensitive Hashing to approximate FFNs, and recasts most essential operations as a memory look-up, leveraging the trade-off between the two resources on any platform: compute and memory.

Abstract

While GPU clusters are the de facto choice for training large deep neural network (DNN) models today, several reasons including ease of workflow, security and cost have led to efforts investigating whether CPUs may be viable for inference in routine use in many sectors of the industry. But the imbalance between the compute capabilities of GPUs and CPUs is huge. Motivated by these considerations, we study a module which is a workhorse within modern DNN architectures, GEMM based Feed Forward Networks (FFNs), and assess the extent to which it can be made compute- (or FLOP-) lite. Specifically, we propose an alternative formulation (we call it LookupFFN) to GEMM based FFNs inspired by the recent studies of using Locality Sensitive Hashing (LSH) to approximate FFNs. Our formulation recasts most essential operations as a memory look-up, leveraging the trade-off between the two resources on any platform: compute and memory (since CPUs offer it in abundance). For RoBERTa language model pretraining, our formulation achieves similar performance compared to GEMM based FFNs, while dramatically reducing the required FLOP. Our development is complemented with a detailed hardware profiling of strategies that will maximize efficiency -- not just on contemporary hardware but on products that will be offered in the near/medium term future. Code is avaiable at \url{https://github.com/mlpen/LookupFFN}.

LookupFFN: Making Transformers Compute-lite for CPU inference

TL;DR

This work proposes an alternative formulation to GEMM based FFNs inspired by the recent studies of using Locality Sensitive Hashing to approximate FFNs, and recasts most essential operations as a memory look-up, leveraging the trade-off between the two resources on any platform: compute and memory.

Abstract

While GPU clusters are the de facto choice for training large deep neural network (DNN) models today, several reasons including ease of workflow, security and cost have led to efforts investigating whether CPUs may be viable for inference in routine use in many sectors of the industry. But the imbalance between the compute capabilities of GPUs and CPUs is huge. Motivated by these considerations, we study a module which is a workhorse within modern DNN architectures, GEMM based Feed Forward Networks (FFNs), and assess the extent to which it can be made compute- (or FLOP-) lite. Specifically, we propose an alternative formulation (we call it LookupFFN) to GEMM based FFNs inspired by the recent studies of using Locality Sensitive Hashing (LSH) to approximate FFNs. Our formulation recasts most essential operations as a memory look-up, leveraging the trade-off between the two resources on any platform: compute and memory (since CPUs offer it in abundance). For RoBERTa language model pretraining, our formulation achieves similar performance compared to GEMM based FFNs, while dramatically reducing the required FLOP. Our development is complemented with a detailed hardware profiling of strategies that will maximize efficiency -- not just on contemporary hardware but on products that will be offered in the near/medium term future. Code is avaiable at \url{https://github.com/mlpen/LookupFFN}.
Paper Structure (11 sections, 19 equations, 6 figures, 6 tables)

This paper contains 11 sections, 19 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The left plot shows the number of hash functions used versus the percentage of top-x nearest neighbors found using these hashes. A large number of hash functions are needed for accurate MIPS result. The query time is linearly proportional to the number of hash functions. The right plot shows the bucket size of each bucket. It visualizes the bucket skewness issue.
  • Figure 2: High level comparison of each method. The true $\mathcal{F}(\cdot) = \sum_{i=1}^h \sigma(\inner{\cdot, \left[\mathbf{W}\right]_i}) \left[\mathbf{V}\right]_i$ is constructed as a function in $S^2$. Here, $\left[\mathbf{W}\right]_i \in S^2$ and $\left[\mathbf{V}\right]_i \in \mathbb{R}$. The points $\left[\mathbf{W}\right]_i$ are marked in the left three figures. SLIDE, MONGOOSE, and YOSO try to construct an approximation of the true $\mathcal{F}(\cdot)$ via different uses of LSH partitions, so whenever $\mathcal{F}(\cdot)$ is updated, the LSH partitions need to be updated. Rather than approximating the function $\mathcal{F}$, our proposed method is plugged into a deep learning model and uses the back-propagated gradient to learn appropriate transformation similar to a vanilla FFN.
  • Figure 3: Illustration of LookupFFN operations.
  • Figure 4: Approximation capacity vs FLOPs and parameters for each efficient projections. Hadamard denotes a variant of ACDC by replacing discrete cosine transform with Hadamard transform. The vertical dash lines are the FLOPs and parameters of the vanilla projection. Any results to the right of the vertical dashed lines are not meaningful, as there is no efficiency gain.
  • Figure 5: Visualization of efficient projections. ACDC and its Hadamard variant increase their capacity by increasing the depth. BH4 increases its capacity by increasing its block size.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Remark 3.1
  • Remark 3.2