Table of Contents
Fetching ...

FERMI-ML: A Flexible and Resource-Efficient Memory-In-Situ SRAM Macro for TinyML acceleration

Mukul Lokhande, Akash Sankhe, S. V. Jaya Chand, Santosh Kumar Vishvakarma

TL;DR

FERMI-ML addresses the energy and bandwidth costs of TinyML on AIoT devices by delivering a flexible Memory-In-Situ SRAM macro that performs computation inside the memory array. The approach combines a 9T XNOR-based RX9T bit-cell with a 22T 4:2 compressor to enable variable-precision MAC and CAM inside a 4 KB macro, supporting Normal, CAM, and PIM modes with Posit-4/FP-4 precision. Post-layout results at 65 nm show 350 MHz operation at 0.9 V, achieving 1.93 TOPS and 364 TOPS/W, with QoR exceeding 97.5% on InceptionV4 and ResNet-18. The work demonstrates a compact, reconfigurable MIS macro capable of mixed-precision TinyML workloads and LUT-based non-linear activations, with potential integration as an L3 cache in RISC-V edge AI SoCs.

Abstract

The growing demand for low-power and area-efficient TinyML inference on AIoT devices necessitates memory architectures that minimise data movement while sustaining high computational efficiency. This paper presents FERMI-ML, a Flexible and Resource-Efficient Memory-In-Situ (MIS) SRAM macro designed for TinyML acceleration. The proposed 9T XNOR-based RX9T bit-cell integrates a 5T storage cell with a 4T XNOR compute unit, enabling variable-precision MAC and CAM operations within the same array. A 22-transistor (C22T) compressor-tree-based accumulator facilitates logarithmic 1-64-bit MAC computation with reduced delay and power compared to conventional adder trees. The 4 KB macro achieves dual functionality for in-situ computation and CAM-based lookup operations, supporting Posit-4 or FP-4 precision. Post-layout results at 65 nm show operation at 350 MHz with 0.9 V, delivering a throughput of 1.93 TOPS and an energy efficiency of 364 TOPS/W, while maintaining a Quality-of-Result (QoR) above 97.5% with InceptionV4 and ResNet-18. FERMI-ML thus demonstrates a compact, reconfigurable, and energy-aware digital Memory-In-Situ macro capable of supporting mixed-precision TinyML workloads.

FERMI-ML: A Flexible and Resource-Efficient Memory-In-Situ SRAM Macro for TinyML acceleration

TL;DR

FERMI-ML addresses the energy and bandwidth costs of TinyML on AIoT devices by delivering a flexible Memory-In-Situ SRAM macro that performs computation inside the memory array. The approach combines a 9T XNOR-based RX9T bit-cell with a 22T 4:2 compressor to enable variable-precision MAC and CAM inside a 4 KB macro, supporting Normal, CAM, and PIM modes with Posit-4/FP-4 precision. Post-layout results at 65 nm show 350 MHz operation at 0.9 V, achieving 1.93 TOPS and 364 TOPS/W, with QoR exceeding 97.5% on InceptionV4 and ResNet-18. The work demonstrates a compact, reconfigurable MIS macro capable of mixed-precision TinyML workloads and LUT-based non-linear activations, with potential integration as an L3 cache in RISC-V edge AI SoCs.

Abstract

The growing demand for low-power and area-efficient TinyML inference on AIoT devices necessitates memory architectures that minimise data movement while sustaining high computational efficiency. This paper presents FERMI-ML, a Flexible and Resource-Efficient Memory-In-Situ (MIS) SRAM macro designed for TinyML acceleration. The proposed 9T XNOR-based RX9T bit-cell integrates a 5T storage cell with a 4T XNOR compute unit, enabling variable-precision MAC and CAM operations within the same array. A 22-transistor (C22T) compressor-tree-based accumulator facilitates logarithmic 1-64-bit MAC computation with reduced delay and power compared to conventional adder trees. The 4 KB macro achieves dual functionality for in-situ computation and CAM-based lookup operations, supporting Posit-4 or FP-4 precision. Post-layout results at 65 nm show operation at 350 MHz with 0.9 V, delivering a throughput of 1.93 TOPS and an energy efficiency of 364 TOPS/W, while maintaining a Quality-of-Result (QoR) above 97.5% with InceptionV4 and ResNet-18. FERMI-ML thus demonstrates a compact, reconfigurable, and energy-aware digital Memory-In-Situ macro capable of supporting mixed-precision TinyML workloads.

Paper Structure

This paper contains 9 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Detailed circuitry showing (a) Proposed Memory-In-situ SRAM bank architecture, with detailed (b) novel resource-efficient compressor tree Structure
  • Figure 2: Schematic for Resource-efficient 9T XNOR SRAM bit-cell for Memory-In-Situ processing.
  • Figure 3: Schematic for novel 4:2 compressor with 22T, for faster and power-efficient accumulation.
  • Figure 4: Post-layout performance comparison with State-of-the-Art Memory-In-Situ SRAM bitcells [10], [11], [14], [15], [17]-[20], [32], in terms of (a) Power and Read delay, (b) Area and Write energy.
  • Figure 5: Impact of Dynamic voltage-frequency scaling on energy-consumption of proposed SRAM macro for (a) Memory-In-situ XAC/MAC operation in matrix multiplication, (b) CAM operation for Look-up-table.
  • ...and 1 more figures