Table of Contents
Fetching ...

bitSMM: A bit-Serial Matrix Multiplication Accelerator

Pedro Antunes, Artur Podobas

Abstract

Neural-network (NN) inference is increasingly present on-board spacecraft to reduce downlink bandwidth and enable timely decision making. However, the power and reliability constraints of space missions limit the applicability of many state-of-the-art NN accelerators. This paper presents bitSMM, a bit-serial matrix multiplication accelerator built around a systolic array of bit-serial multiply--accumulate (MAC) units. The design supports runtime-configurable operand precision from 1 to 16 bits and evaluates two MAC variants: a Booth-inspired architecture and a standard binary multiplication with correction architecture. We implement bitSMM in [System]Verilog and evaluate it on an AMD ZCU104 FPGA and through ASIC physical implementation using the asap7 and nangate45 process design kits. On the FPGA, bitSMM achieves up to 19.2~GOPS and 2.973~GOPS/W, and in asap7 it achieves up to 73.22~GOPS, 552~GOPS/mm$^2$, and 40.8~GOPS/W.

bitSMM: A bit-Serial Matrix Multiplication Accelerator

Abstract

Neural-network (NN) inference is increasingly present on-board spacecraft to reduce downlink bandwidth and enable timely decision making. However, the power and reliability constraints of space missions limit the applicability of many state-of-the-art NN accelerators. This paper presents bitSMM, a bit-serial matrix multiplication accelerator built around a systolic array of bit-serial multiply--accumulate (MAC) units. The design supports runtime-configurable operand precision from 1 to 16 bits and evaluates two MAC variants: a Booth-inspired architecture and a standard binary multiplication with correction architecture. We implement bitSMM in [System]Verilog and evaluate it on an AMD ZCU104 FPGA and through ASIC physical implementation using the asap7 and nangate45 process design kits. On the FPGA, bitSMM achieves up to 19.2~GOPS and 2.973~GOPS/W, and in asap7 it achieves up to 73.22~GOPS, 552~GOPS/mm, and 40.8~GOPS/W.
Paper Structure (13 sections, 11 equations, 6 figures, 4 tables)

This paper contains 13 sections, 11 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Generic systolic array (SA).
  • Figure 2: Combinational logic of the Booth-based bit-serial MAC.
  • Figure 3: Combinational logic of the SBMwC-based bit-serial MAC.
  • Figure 4: Overall architecture of the bit-serial SA, including parallel-to-serial (P2S) converters, MAC grid, and data propagation registers.
  • Figure 5: Output readout network of the SA, showing the snake-like traversal of the read enable signal and multiplexed accumulator outputs.
  • ...and 1 more figures