Table of Contents
Fetching ...

An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Mohsen Dehghankar, Mahdi Erfanian, Abolfazl Asudeh

TL;DR

This paper proposes algorithms to improve the inference time and memory efficiency of DNNs with binary and ternary weight matrices by focusing on matrix multiplication as the bottleneck operation of inference.

Abstract

Despite their tremendous success and versatility, Deep Neural Networks (DNNs) such as Large Language Models (LLMs) suffer from inference inefficiency and rely on advanced computational infrastructure. To address these challenges and make these models more accessible and cost-effective, in this paper, we propose algorithms to improve the inference time and memory efficiency of DNNs with binary and ternary weight matrices. Particularly focusing on matrix multiplication as the bottleneck operation of inference, we observe that, once trained, the weight matrices of a model no longer change. This allows us to preprocess these matrices and create indices that help reduce the storage requirements by a logarithmic factor while enabling our efficient inference algorithms. Specifically, for a $n\times n$ weight matrix, our efficient algorithm guarantees a time complexity of $O(\frac{n^2}{\log n})$, a logarithmic factor improvement over the standard vector-matrix multiplication. Besides theoretical analysis, we conduct extensive experiments to evaluate the practical efficiency of our algorithms. Our results confirm the superiority of our approach both with respect to time and memory, as we observed a reduction in the multiplication time up to 29x and memory usage up to 6x. When applied to LLMs, our experiments show up to a 5.24x speedup in the inference time.

An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

TL;DR

This paper proposes algorithms to improve the inference time and memory efficiency of DNNs with binary and ternary weight matrices by focusing on matrix multiplication as the bottleneck operation of inference.

Abstract

Despite their tremendous success and versatility, Deep Neural Networks (DNNs) such as Large Language Models (LLMs) suffer from inference inefficiency and rely on advanced computational infrastructure. To address these challenges and make these models more accessible and cost-effective, in this paper, we propose algorithms to improve the inference time and memory efficiency of DNNs with binary and ternary weight matrices. Particularly focusing on matrix multiplication as the bottleneck operation of inference, we observe that, once trained, the weight matrices of a model no longer change. This allows us to preprocess these matrices and create indices that help reduce the storage requirements by a logarithmic factor while enabling our efficient inference algorithms. Specifically, for a weight matrix, our efficient algorithm guarantees a time complexity of , a logarithmic factor improvement over the standard vector-matrix multiplication. Besides theoretical analysis, we conduct extensive experiments to evaluate the practical efficiency of our algorithms. Our results confirm the superiority of our approach both with respect to time and memory, as we observed a reduction in the multiplication time up to 29x and memory usage up to 6x. When applied to LLMs, our experiments show up to a 5.24x speedup in the inference time.

Paper Structure

This paper contains 46 sections, 6 theorems, 13 equations, 12 figures, 1 table, 3 algorithms.

Key Result

Proposition 2.1

Any ternary matrix $A$ can be expressed as $A = B^{(1)} - B^{(2)}$, where $B^{(1)}$ and $B^{(2)}$ are the following binary matrices:

Figures (12)

  • Figure 1: A visualization of the Redundant Segment Reduction method. The calculation of $\vec{v} \cdot B$. In this example, $k = 2$.
  • Figure 2: The Full Segmentation of Example \ref{['exp:order']}. There is no starting index for row $10$, so we skip it by using the same start index of next available value. The Full Segmentation list is the second row of the table.
  • Figure 3: Visualizing Step 2 of RSR++ (Algorithm \ref{['alg:rsr++']}) versus RSR at inference time.
  • Figure 4: Comparison of RSR, RSR++, and Standard on native C++ implementation for Binary Matrix Multiplication. The speedup values are between RSR++ and Standard. Each value is the average of 10 different runs.
  • Figure 5: Memory consumption of RSR after the preprocessing is done. Compared to memory required for the Standard matrix multiplication (NumPy). In RSR, we only store permutations and segmentation lists in the memory.
  • ...and 7 more figures

Theorems & Definitions (13)

  • Proposition 2.1
  • Definition 3.1: $k$-Column Block
  • Definition 3.2: Binary Row Order
  • Example 3.3
  • Definition 3.4: Segmentation List
  • Proposition 3.5
  • Theorem 3.6
  • Definition 4.1: Segmented Sum
  • Lemma 4.2
  • Theorem 4.3
  • ...and 3 more