Table of Contents
Fetching ...

Fast, Scalable, Energy-Efficient Non-element-wise Matrix Multiplication on FPGA

Xuqi Zhu, Huaizhi Zhang, JunKyu Lee, Jiacheng Zhu, Chandrajit Pal, Sangeet Saha, Klaus D. McDonald-Maier, Xiaojun Zhai

TL;DR

The paper tackles the bottleneck of matrix multiplications in FPGA‑based neural network inference by introducing the Approximate Multiplication Unit (AMU), a non‑element‑wise LUT‑based multiplier built atop the MADDNESS approximation. It combines three hardware‑friendly optimisations—I/O pruning, feature map reorganization, and parameter compression—to decouple computation from input size and to optimize memory usage on FPGA. Empirical results on a Xilinx platform show substantial gains, with up to 9× higher throughput and 112× higher energy efficiency than state‑of‑the‑art FPGA QNN accelerators, while maintaining acceptable accuracy losses. The AMU architecture, comprising an Allocator, Encoder, and Aggregator, demonstrates scalable performance across problem sizes and network types, and offers a competitive path toward ASIC‑like efficiency in FPGA implementations.

Abstract

Modern Neural Network (NN) architectures heavily rely on vast numbers of multiply-accumulate arithmetic operations, constituting the predominant computational cost. Therefore, this paper proposes a high-throughput, scalable and energy efficient non-element-wise matrix multiplication unit on FPGAs as a basic component of the NNs. We firstly streamline inter-layer and intra-layer redundancies of MADDNESS algorithm, a LUT-based approximate matrix multiplication, to design a fast, efficient scalable approximate matrix multiplication module termed "Approximate Multiplication Unit (AMU)". The AMU optimizes LUT-based matrix multiplications further through dedicated memory management and access design, decoupling computational overhead from input resolution and boosting FPGA-based NN accelerator efficiency significantly. The experimental results show that using our AMU achieves up to 9x higher throughput and 112x higher energy efficiency over the state-of-the-art solutions for the FPGA-based Quantised Neural Network (QNN) accelerators.

Fast, Scalable, Energy-Efficient Non-element-wise Matrix Multiplication on FPGA

TL;DR

The paper tackles the bottleneck of matrix multiplications in FPGA‑based neural network inference by introducing the Approximate Multiplication Unit (AMU), a non‑element‑wise LUT‑based multiplier built atop the MADDNESS approximation. It combines three hardware‑friendly optimisations—I/O pruning, feature map reorganization, and parameter compression—to decouple computation from input size and to optimize memory usage on FPGA. Empirical results on a Xilinx platform show substantial gains, with up to 9× higher throughput and 112× higher energy efficiency than state‑of‑the‑art FPGA QNN accelerators, while maintaining acceptable accuracy losses. The AMU architecture, comprising an Allocator, Encoder, and Aggregator, demonstrates scalable performance across problem sizes and network types, and offers a competitive path toward ASIC‑like efficiency in FPGA implementations.

Abstract

Modern Neural Network (NN) architectures heavily rely on vast numbers of multiply-accumulate arithmetic operations, constituting the predominant computational cost. Therefore, this paper proposes a high-throughput, scalable and energy efficient non-element-wise matrix multiplication unit on FPGAs as a basic component of the NNs. We firstly streamline inter-layer and intra-layer redundancies of MADDNESS algorithm, a LUT-based approximate matrix multiplication, to design a fast, efficient scalable approximate matrix multiplication module termed "Approximate Multiplication Unit (AMU)". The AMU optimizes LUT-based matrix multiplications further through dedicated memory management and access design, decoupling computational overhead from input resolution and boosting FPGA-based NN accelerator efficiency significantly. The experimental results show that using our AMU achieves up to 9x higher throughput and 112x higher energy efficiency over the state-of-the-art solutions for the FPGA-based Quantised Neural Network (QNN) accelerators.
Paper Structure (27 sections, 6 equations, 9 figures, 3 tables)

This paper contains 27 sections, 6 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: (a) Fully connected layer, (b) MADDNESS offline training and (c) MADDNESS online multiplications
  • Figure 2: I/O pruning and parameter compression: inter-layer and intra-layer redundancies can be streamlined when multiple MADDNESS matrix-vector multiplication units are used sequentially
  • Figure 3: Overview of the proposed AMU architecture: (a) The I/O pruning and three components of the AMU: (b) Allocator; (c) Encoder; (d) Aggregator
  • Figure 4: AMU resource utilisation and throughput (II) analysis
  • Figure 5: The impact of prefix layer configurations on AMU-based MLP accuracy as depth increases, using the first layer setting as an example. "Exact" is 'Exact matrix multiplication', the pair $(I, N)$ represent $I$ codebooks and $N$ prototypes for the first hidden layer (The rest layers with setting (4, 16))
  • ...and 4 more figures