Table of Contents
Fetching ...

Siracusa: A 16 nm Heterogenous RISC-V SoC for Extended Reality with At-MRAM Neural Engine

Arpan Suravi Prasad, Moritz Scherer, Francesco Conti, Davide Rossi, Alfio Di Mauro, Manuel Eggimann, Jorge Tómas Gómez, Ziyun Li, Syed Shakib Sarwar, Zhao Wang, Barbara De Salvo, Luca Benini

TL;DR

Siracusa presents a 16 nm near-sensor heterogeneous SoC that tightly integrates an all-digital N-EUREKA neural engine with high-density MRAM weight memory. The At-MRAM approach doubles weight-transfer bandwidth and enables all-weights-on-chip inference, delivering up to 1.95 TOps and 8.84 TOpJ under realistic XR workloads. Core contributions include a dual-memory neural subsystem (MRAM weights and SRAM tiles), software-assisted virtual memory paging, and a tile-activation memory that together dramatically reduce end-to-end latency (by up to 1.7x) and energy (by up to 3x) versus conventional L3-based schemes. The results demonstrate state-of-the-art area efficiency (65.2 GOp/s/mm^2) and end-to-end performance (698 GOps throughput at 8-bit quantization) with practical implications for XR devices and wearable deployments.

Abstract

Extended reality (XR) applications are Machine Learning (ML)-intensive, featuring deep neural networks (DNNs) with millions of weights, tightly latency-bound (10-20 ms end-to-end), and power-constrained (low tens of mW average power). While ML performance and efficiency can be achieved by introducing neural engines within low-power systems-on-chip (SoCs), system-level power for nontrivial DNNs depends strongly on the energy of non-volatile memory (NVM) access for network weights. This work introduces Siracusa, a near-sensor heterogeneous SoC for next-generation XR devices manufactured in 16 nm CMOS. Siracusa couples an octa-core cluster of RISC-V digital signal processing cores with a novel tightly-coupled "At-Memory" integration between a state-of-the-art digital neural engine called N-EUREKA and an on-chip NVM based on magnetoresistive memory(MRAM), achieving 1.7x higher throughput and 3x better energy efficiency than XR SoCs using NVM as background memory. The fabricated SoC prototype achieves an area efficiency of 65.2 GOp/s/mm2 and a peak energy efficiency of 8.84 TOp/J for DNN inference while supporting complex heterogeneous application workloads, which combine ML with conventional signal processing and control.

Siracusa: A 16 nm Heterogenous RISC-V SoC for Extended Reality with At-MRAM Neural Engine

TL;DR

Siracusa presents a 16 nm near-sensor heterogeneous SoC that tightly integrates an all-digital N-EUREKA neural engine with high-density MRAM weight memory. The At-MRAM approach doubles weight-transfer bandwidth and enables all-weights-on-chip inference, delivering up to 1.95 TOps and 8.84 TOpJ under realistic XR workloads. Core contributions include a dual-memory neural subsystem (MRAM weights and SRAM tiles), software-assisted virtual memory paging, and a tile-activation memory that together dramatically reduce end-to-end latency (by up to 1.7x) and energy (by up to 3x) versus conventional L3-based schemes. The results demonstrate state-of-the-art area efficiency (65.2 GOp/s/mm^2) and end-to-end performance (698 GOps throughput at 8-bit quantization) with practical implications for XR devices and wearable deployments.

Abstract

Extended reality (XR) applications are Machine Learning (ML)-intensive, featuring deep neural networks (DNNs) with millions of weights, tightly latency-bound (10-20 ms end-to-end), and power-constrained (low tens of mW average power). While ML performance and efficiency can be achieved by introducing neural engines within low-power systems-on-chip (SoCs), system-level power for nontrivial DNNs depends strongly on the energy of non-volatile memory (NVM) access for network weights. This work introduces Siracusa, a near-sensor heterogeneous SoC for next-generation XR devices manufactured in 16 nm CMOS. Siracusa couples an octa-core cluster of RISC-V digital signal processing cores with a novel tightly-coupled "At-Memory" integration between a state-of-the-art digital neural engine called N-EUREKA and an on-chip NVM based on magnetoresistive memory(MRAM), achieving 1.7x higher throughput and 3x better energy efficiency than XR SoCs using NVM as background memory. The fabricated SoC prototype achieves an area efficiency of 65.2 GOp/s/mm2 and a peak energy efficiency of 8.84 TOp/J for DNN inference while supporting complex heterogeneous application workloads, which combine ML with conventional signal processing and control.
Paper Structure (23 sections, 11 figures, 3 tables)

This paper contains 23 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Architectural overview of the Siracusa SoC consisting of the IO-Domain (upper left), a Heterogenous Cluster which includes the bit-serial N-EUREKA hardware accelerator (upper right) and 8 RISC-V cores (lower right). N-EUREKA is tightly coupled to the neural memory subsystem consisting of 4Mi SRAM and 4Mi of proprietary STT- IP. The RISC-V cores and N-EUREKA share access to the L1 memory (middle right) through the heterogeneous interconnect, consisting of a shallow and logarithmic branch, synchronized by a conflict manager with programmable access priority.
  • Figure 2: Overview of the datapath architecture of N-EUREKA. The core of N-EUREKA's datapath consists of 36 , which receive input activations from dual inputs buffers and weights from a dedicated weight streamer. The L1 streamer feeds into the input buffers and transfers outputs to the shared L1 memory. A detailed overview of the datapath is shown on the right. Each contains 32 columns, each containing nine bit-serial multipliers, an adder, and a shifter. Each column is connected to a dedicated accumulator used to store partial results.
  • Figure 3: Overview of execution of a single layer on N-EUREKA. ① captures the tiling. ②, ③ and ④ show the execution of dense 3$\times$3, 1$\times$1, and depthwise 3$\times$3 convolutions on N-EUREKA, respectively.
  • Figure 4: A) Detail of the integration of N-EUREKA with the MRAM Weight Memory Subsystem. B) Example of N-EUREKA execution, overlapping prefetching through L1 streamer and weight streaming through weight streamer, and detail of the weight streamer operation: ① two weight requests from N-EUREKA are propagated through the in two cycles and ② propagated to the ; ③,④ the responds to two requests on parallel banks with latency = 3 internal cycles; ⑤ responses are propagated back to N-EUREKA with a total of 9 cycles of latency. C) Overall architecture of the Neural Memory System with detail of paging mechanism.
  • Figure 5: Annotated micrograph of a 4mm$\times$4mm Siracusa die. The highlighted Cluster components include the RISC-V cores, L1 memory, instruction cache, N-EUREKA and weight and tile memories, occupying a total of 10.7mm. Besides the Cluster IPs, the components, including peripheral controllers, the PLLs and L2 memory, occupying 4.3mm are highlighted.
  • ...and 6 more figures