Table of Contents
Fetching ...

DARE: An Irregularity-Tolerant Matrix Processing Unit with a Densifying ISA and Filtered Runahead Execution

Xin Yang, Xin Fan, Zengshi Wang, Jun Han

TL;DR

DARE addresses irregular memory access and compute inefficiencies in sparse DNN workloads on MPUs by jointly extending the ISA with a densification mechanism and by introducing a lightweight, filtered runahead execution framework. It combines Gather Scatter Access for non-strided sparsity with a Runahead Issue Queue and a Vector Matrix Register to densify sparse operations and improve PE utilization, while the RFU reduces prefetch redundancy via a dynamic latency-based classifier. Across SpMM and SDDMM workloads, DARE achieves up to 4.44× performance and up to 22.8× energy efficiency improvements with only about 3.19× less hardware overhead than prior NVR approaches, demonstrating strong robustness across memory environments. The work presents a practical path for co-optimizing hardware and sparsity-aware algorithms on CPUs/MPUs with modest area and energy costs, enabling more efficient sparse DNN accelerators.

Abstract

Deep Neural Networks (DNNs) are widely applied across domains and have shown strong effectiveness. As DNN workloads increasingly run on CPUs, dedicated Matrix Processing Units (MPUs) and Matrix Instruction Set Architectures (ISAs) have been introduced. At the same time, sparsity techniques are widely adopted in algorithms to reduce computational cost. Despite these advances, insufficient hardware-algorithm co-optimization leads to suboptimal performance. On the memory side, sparse DNNs incur irregular access patterns that cause high cache miss rates. While runahead execution is a promising prefetching technique, its direct application to MPUs is often ineffective due to significant prefetch redundancy. On the compute side, stride constraints in current Matrix ISAs prevent the densification of multiple logically related sparse operations, resulting in poor utilization of MPU processing elements. To address these irregularities, we propose DARE, an irregularity-tolerant MPU with a Densifying ISA and filtered Runahead Execution. DARE extends the ISA to support densifying sparse operations and equips a lightweight runahead mechanism with filtering capability. Experimental results show that DARE improves performance by 1.04$\times$ to 4.44$\times$ and increases energy efficiency by 1.00$\times$ to 22.8$\times$ over the baseline, with 3.91$\times$ lower hardware overhead than NVR.

DARE: An Irregularity-Tolerant Matrix Processing Unit with a Densifying ISA and Filtered Runahead Execution

TL;DR

DARE addresses irregular memory access and compute inefficiencies in sparse DNN workloads on MPUs by jointly extending the ISA with a densification mechanism and by introducing a lightweight, filtered runahead execution framework. It combines Gather Scatter Access for non-strided sparsity with a Runahead Issue Queue and a Vector Matrix Register to densify sparse operations and improve PE utilization, while the RFU reduces prefetch redundancy via a dynamic latency-based classifier. Across SpMM and SDDMM workloads, DARE achieves up to 4.44× performance and up to 22.8× energy efficiency improvements with only about 3.19× less hardware overhead than prior NVR approaches, demonstrating strong robustness across memory environments. The work presents a practical path for co-optimizing hardware and sparsity-aware algorithms on CPUs/MPUs with modest area and energy costs, enabling more efficient sparse DNN accelerators.

Abstract

Deep Neural Networks (DNNs) are widely applied across domains and have shown strong effectiveness. As DNN workloads increasingly run on CPUs, dedicated Matrix Processing Units (MPUs) and Matrix Instruction Set Architectures (ISAs) have been introduced. At the same time, sparsity techniques are widely adopted in algorithms to reduce computational cost. Despite these advances, insufficient hardware-algorithm co-optimization leads to suboptimal performance. On the memory side, sparse DNNs incur irregular access patterns that cause high cache miss rates. While runahead execution is a promising prefetching technique, its direct application to MPUs is often ineffective due to significant prefetch redundancy. On the compute side, stride constraints in current Matrix ISAs prevent the densification of multiple logically related sparse operations, resulting in poor utilization of MPU processing elements. To address these irregularities, we propose DARE, an irregularity-tolerant MPU with a Densifying ISA and filtered Runahead Execution. DARE extends the ISA to support densifying sparse operations and equips a lightweight runahead mechanism with filtering capability. Experimental results show that DARE improves performance by 1.04 to 4.44 and increases energy efficiency by 1.00 to 22.8 over the baseline, with 3.91 lower hardware overhead than NVR.

Paper Structure

This paper contains 27 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: (a) Runtime of sparse SDDMM normalized to that of dense GEMM on an AMX-like MPU. Oracle assumes a cache without misses. (b) Performance of an MPU with NVR b_nvr, normalized to a baseline MPU without NVR. (c) Processing Element (PE) utilization in a systolic array under various workloads, defined as the ratio of active PEs to the total number of PEs during execution.
  • Figure 2: (a) An exmaple computation flow of SDDMM. Only the non-zero positions require computation. (b) The challenges encountered by MPUs with runahead technique on sparse DNN workloads: PE under-utilization (Section \ref{['sec:isaConstrain']}) and runahead prefetch redundancy (Section \ref{['sec:re']}). (c) DARE's solutions to the challenges: ISA extension to support non-strided access (Section \ref{['sec_isa']}) and runahead execution with a runahead filter (Section \ref{['sec_arch']}).
  • Figure 3: (a) The cache miss rate, prefetch redundancy and the cache bandwidth occupancy in NVR on SDDMM. (b) The average memory access latency in baseline and NVR.
  • Figure 4: (a) Overview of the DARE architecture. Blue blocks indicate components proposed by DARE. (b) Runahead Issue Queue (RIQ) with a sub-module named the Dependency Management Unit (DMU). (c) Vector Matrix Register (VMR) as an auxiliary register file to store base vector addresses. (d) Runahead Filter Unit (RFU) with a threshold-based classifier.
  • Figure 5: Performance normalized to baseline. DARE achieves 1.04$\times$ to 4.44$\times$ performance improvement on average compared to the baseline across various benchmarks. DARE is reported as the better between DARE-FRE and DARE-full.
  • ...and 4 more figures