NeuralMatrix: Compute the Entire Neural Networks with Linear Matrix Operations for Efficient Inference
Ruiqi Sun, Siwei Ye, Jie Zhao, Xin He, Jianzhe Lin, Yiran Li, An Zou
TL;DR
NeuralMatrix tackles the core inefficiency of executing diverse DNN computations by converting entire networks into linear matrix operations that run on a GEMM accelerator. It achieves this through a pipeline that maps linear operations to GEMM, applies elastic approximations for nonlinearities (with horizontal size optimization and vertical bias correction), and employs approximation-aware training to preserve accuracy. Empirical results on ResNet and BERT/RoBERTa show minimal accuracy loss (often <0.5%) and substantial hardware efficiency gains, with up to $38.72$× throughput-per-power improvements over CPUs, GPUs, and SoCs. This approach reduces hardware specialization needs while delivering high performance, enabling versatile DNNs to run efficiently on a single GEMM-based accelerator.
Abstract
The inherent diversity of computation types within the deep neural network (DNN) models often requires a variety of specialized units in hardware processors, which limits computational efficiency, increasing both inference latency and power consumption, especially when the hardware processor needs to support and execute different neural networks. In this study, we introduce NeuralMatrix, which elastically transforms the computations of entire DNNs into linear matrix operations. This transformation allows seamless execution of various DNN models all with matrix operations and paves the way for running versatile DNN models with a single General Matrix Multiplication (GEMM) accelerator.Extensive experiments with both CNN and transformer-based models demonstrate the potential of NeuralMatrix to accurately and efficiently execute a wide range of DNN models, achieving 2.17-38.72 times computation efficiency (i.e., throughput per power) compared to CPUs, GPUs, and SoC platforms. This level of efficiency is usually only attainable with the accelerator designed for a specific neural network.
