NeuralMatrix: Compute the Entire Neural Networks with Linear Matrix Operations for Efficient Inference

Ruiqi Sun; Siwei Ye; Jie Zhao; Xin He; Jianzhe Lin; Yiran Li; An Zou

NeuralMatrix: Compute the Entire Neural Networks with Linear Matrix Operations for Efficient Inference

Ruiqi Sun, Siwei Ye, Jie Zhao, Xin He, Jianzhe Lin, Yiran Li, An Zou

TL;DR

NeuralMatrix tackles the core inefficiency of executing diverse DNN computations by converting entire networks into linear matrix operations that run on a GEMM accelerator. It achieves this through a pipeline that maps linear operations to GEMM, applies elastic approximations for nonlinearities (with horizontal size optimization and vertical bias correction), and employs approximation-aware training to preserve accuracy. Empirical results on ResNet and BERT/RoBERTa show minimal accuracy loss (often <0.5%) and substantial hardware efficiency gains, with up to $38.72$× throughput-per-power improvements over CPUs, GPUs, and SoCs. This approach reduces hardware specialization needs while delivering high performance, enabling versatile DNNs to run efficiently on a single GEMM-based accelerator.

Abstract

The inherent diversity of computation types within the deep neural network (DNN) models often requires a variety of specialized units in hardware processors, which limits computational efficiency, increasing both inference latency and power consumption, especially when the hardware processor needs to support and execute different neural networks. In this study, we introduce NeuralMatrix, which elastically transforms the computations of entire DNNs into linear matrix operations. This transformation allows seamless execution of various DNN models all with matrix operations and paves the way for running versatile DNN models with a single General Matrix Multiplication (GEMM) accelerator.Extensive experiments with both CNN and transformer-based models demonstrate the potential of NeuralMatrix to accurately and efficiently execute a wide range of DNN models, achieving 2.17-38.72 times computation efficiency (i.e., throughput per power) compared to CPUs, GPUs, and SoC platforms. This level of efficiency is usually only attainable with the accelerator designed for a specific neural network.

NeuralMatrix: Compute the Entire Neural Networks with Linear Matrix Operations for Efficient Inference

TL;DR

× throughput-per-power improvements over CPUs, GPUs, and SoCs. This approach reduces hardware specialization needs while delivering high performance, enabling versatile DNNs to run efficiently on a single GEMM-based accelerator.

Abstract

Paper Structure (20 sections, 11 figures, 1 table, 3 algorithms)

This paper contains 20 sections, 11 figures, 1 table, 3 algorithms.

Introduction
Background and Related Work
Intensive / Versatile Computations in DNNs
General Matrix Multiplication Accelerator
NeuralMatrix -- Computing Networks with Matrix Operations
Mapping Linear Operations to General Matrix Operation
Elastic Approximation for Nonlinear Operations
Post-Training Approximation
Horizontal Size Optimization
Vertical Bias Correction
Approximation-Aware Training
Model Performance and Computation Cost
Inference Accuracy
CNN-based ResNet
Transformer-based BERT
...and 5 more sections

Figures (11)

Figure 1: NeuralMatrix translates neural network computation into matrix operations, enabling them on a GEMM accelerator.
Figure 2: Overview of NeuralMatrix: Different types of computation in DNN will go through different decision and process steps. An entire neural network can be moved to linear matrix operations and become fully executed by a GEMM accelerator.
Figure 3: Size Optimization
Figure 4: Bias Correction
Figure 6: Network accuracy with CNN-based ResNet.
...and 6 more figures

NeuralMatrix: Compute the Entire Neural Networks with Linear Matrix Operations for Efficient Inference

TL;DR

Abstract

NeuralMatrix: Compute the Entire Neural Networks with Linear Matrix Operations for Efficient Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (11)