Table of Contents
Fetching ...

A New Dataflow Implementation to Improve Energy Efficiency of Monolithic 3D Systolic Arrays

Prachi Shukla, Vasilis F. Pavlidis, Emre Salman, Ayse K. Coskun

TL;DR

This work addresses edge DNN latency and energy challenges by proposing WS-Mono3D, a weight-stationary dataflow implemented on a monolithically stacked Mono3D systolic array. By exploiting high-bandwidth MIV interconnects and on-chip multi-layer RRAM, WS-Mono3D eliminates input and weight forwarding, enabling parallel weight pre-loading and input multicast. The approach, supported by cross-layer architecture-, circuit-, and thermal models, reports up to 47% latency reduction, up to 40% EDP improvement, and up to 81% I/S/W gains over 2D WS, with performance sensitive to the thermal budget. These results demonstrate the practical potential of WS dataflow on MONO3D for energy-efficient edge inference, while underscoring the importance of thermal-aware design and future co-optimizations of dataflows and DNNs.

Abstract

Systolic arrays are popular for executing deep neural networks (DNNs) at the edge. Low latency and energy efficiency are key requirements in edge devices such as drones and autonomous vehicles. Monolithic 3D (MONO3D) is an emerging 3D integration technique that offers ultra-high bandwidth among processing and memory elements with a negligible area overhead. Such high bandwidth can help meet the ever-growing latency and energy efficiency demands for DNNs. This paper presents a novel implementation for weight stationary (WS) dataflow in MONO3D systolic arrays, called WS-MONO3D. WS-MONO3D utilizes multiple resistive RAM layers and SRAM with high-density vertical interconnects to multicast inputs and perform high-bandwidth weight pre-loading while maintaining the same order of multiply-and-accumulate operations as in native WS dataflow. Consequently, WS-MONO3D eliminates input and weight forwarding cycles and, thus, provides up to 40% improvement in energy-delay-product (EDP) over the native WS implementation in 2D at iso-configuration. WS-MONO3D also provides 10X improvement in inference per second per watt per footprint due to multiple vertical tiers. Finally, we also show that temperature impacts the energy efficiency benefits in WS-MONO3D.

A New Dataflow Implementation to Improve Energy Efficiency of Monolithic 3D Systolic Arrays

TL;DR

This work addresses edge DNN latency and energy challenges by proposing WS-Mono3D, a weight-stationary dataflow implemented on a monolithically stacked Mono3D systolic array. By exploiting high-bandwidth MIV interconnects and on-chip multi-layer RRAM, WS-Mono3D eliminates input and weight forwarding, enabling parallel weight pre-loading and input multicast. The approach, supported by cross-layer architecture-, circuit-, and thermal models, reports up to 47% latency reduction, up to 40% EDP improvement, and up to 81% I/S/W gains over 2D WS, with performance sensitive to the thermal budget. These results demonstrate the practical potential of WS dataflow on MONO3D for energy-efficient edge inference, while underscoring the importance of thermal-aware design and future co-optimizations of dataflows and DNNs.

Abstract

Systolic arrays are popular for executing deep neural networks (DNNs) at the edge. Low latency and energy efficiency are key requirements in edge devices such as drones and autonomous vehicles. Monolithic 3D (MONO3D) is an emerging 3D integration technique that offers ultra-high bandwidth among processing and memory elements with a negligible area overhead. Such high bandwidth can help meet the ever-growing latency and energy efficiency demands for DNNs. This paper presents a novel implementation for weight stationary (WS) dataflow in MONO3D systolic arrays, called WS-MONO3D. WS-MONO3D utilizes multiple resistive RAM layers and SRAM with high-density vertical interconnects to multicast inputs and perform high-bandwidth weight pre-loading while maintaining the same order of multiply-and-accumulate operations as in native WS dataflow. Consequently, WS-MONO3D eliminates input and weight forwarding cycles and, thus, provides up to 40% improvement in energy-delay-product (EDP) over the native WS implementation in 2D at iso-configuration. WS-MONO3D also provides 10X improvement in inference per second per watt per footprint due to multiple vertical tiers. Finally, we also show that temperature impacts the energy efficiency benefits in WS-MONO3D.
Paper Structure (11 sections, 1 equation, 4 figures)

This paper contains 11 sections, 1 equation, 4 figures.

Figures (4)

  • Figure 1: A systolic array: 4$\times$4 PE array with on-chip SRAMs.
  • Figure 2: (a) A flip-chip 6-tier Mono3D chip stack with 4 RRAM tiers for storing weights. Each tier is 2.816$\times$2.816 mm$^2$, (b) Top view of (a)'s 2D counterpart.
  • Figure 3: Evaluation framework for WS-Mono3D
  • Figure 4: WS-Mono3D versus WS in 2D for several DNNs at three frequency levels. (a-b) show absolute inference latencies in ms. Latencies in WS-Mono3D are up to 47% lower. (c-d) show absolute power values in Watt (W). (e) shows steady state temperatures in WS-Mono3D with dotted lines for two thermal constraints. (f) Up to 40% EDP benefits in WS-Mono3D w.r.t. WS in 2D. (g) Up to 73% improvement in I/p/s/area in WS-Mono3D w.r.t. WS in 2D. (h) Up to 10$\times$ improvement in I/p/s/footprint in WS-Mono3D w.r.t. WS in 2D.