Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV

Qian Chen; Xiaofeng Yang; Shengli Lu

Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV

Qian Chen, Xiaofeng Yang, Shengli Lu

TL;DR

This work addresses the bottleneck of sparse triangular solve (SpTRSV) by introducing a hardware accelerator that employs a medium granularity dataflow to balance spatial locality and parallelism. A custom compiler maps the sparse DAG to a vector of coarse nodes and fine-edge computations, augmented by a partial sum caching mechanism and an intra-node edges computation reordering algorithm to boost data reuse and reduce bank conflicts. Experimental results on 245 SuiteSparse benchmarks show substantial performance and energy efficiency gains over CPUs, GPUs, and the DPU-v2 accelerator, demonstrating the practicality of the approach for large-scale SpTRSV-like workloads. The combination of VLIW-inspired CUs, software-managed memory, and targeted dataflow optimizations suggests a scalable path for accelerating irregular sparse computations in scientific and engineering applications.

Abstract

Sparse triangular solve (SpTRSV) is widely used in various domains. Numerous studies have been conducted using CPUs, GPUs, and specific hardware accelerators, where dataflows can be categorized into coarse and fine granularity. Coarse dataflows offer good spatial locality but suffer from low parallelism, while fine dataflows provide high parallelism but disrupt the spatial structure, leading to increased nodes and poor data reuse. This paper proposes a novel hardware accelerator for SpTRSV or SpTRSV-like DAGs. The accelerator implements a medium granularity dataflow through hardware-software codesign and achieves both excellent spatial locality and high parallelism. Additionally, a partial sum caching mechanism is introduced to reduce the blocking frequency of processing elements (PEs), and a reordering algorithm of intra-node edges computation is developed to enhance data reuse. Experimental results on 245 benchmarks with node counts reaching up to 85,392 demonstrate that this work achieves average performance improvements of 7.0$\times$ (up to 27.8$\times$) over CPUs and 5.8$\times$ (up to 98.8$\times$) over GPUs. Compared to the state-of-the-art technique (DPU-v2), this work shows a 2.5$\times$ (up to 5.9$\times$) average performance improvement and 1.7$\times$ (up to 4.1$\times$) average energy efficiency enhancement.

Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV

TL;DR

Abstract

(up to 27.8

) over CPUs and 5.8

(up to 98.8

) over GPUs. Compared to the state-of-the-art technique (DPU-v2), this work shows a 2.5

(up to 5.9

) average performance improvement and 1.7

(up to 4.1

) average energy efficiency enhancement.

Paper Structure (22 sections, 3 equations, 12 figures, 4 tables, 2 algorithms)

This paper contains 22 sections, 3 equations, 12 figures, 4 tables, 2 algorithms.

Introduction
Background
Sparse Triangular Solve on CPUs and GPUs
Very-Long-Instruction-Word Architecture
DAG Processing Unit and This Work
System Architecture
Compiler Overview
Accelerator Architecture
Custom Compiler Methodology
Medium Granularity Dataflow
Partial Sum Caching Mechanism
Intra-node Edges Computation Reordering Algorithm
Compiler Performance Analysis
Experiments
Experimental Setup
...and 7 more sections

Figures (12)

Figure 1: Three formats of a sparse triangular matrix. Assuming the values of diagonal entries are 1 and others are -1, the CSR format represents the first five rows of the matrix in (a). $n$ represents the matrix order and $nnz$ represents the number of non-zeros. (c) also illustrates the level-scheduling method level-scheduling.
Figure 2: Architecture comparison of DPU-v2 DPU-v2 and this work.
Figure 3: An example of converting a coarse node 8 into multiple fine nodes and mapping them to the tree-shaped PEs, assuming other nodes have been solved. $L_{ij}$ represents the non-zeros in row $i$ and column $j$ of the coefficient matrix.
Figure 4: (a) Overview of the custom compiler. The spin arrow indicates updating the node with the RHS $b$. (b) The architecture of the proposed accelerator. It consists of $\text{2}^N$ compute units (CUs) connected by input and output interconnects, where $N$ is a hyperparameter. Stream memory stores the sparse matrix non-zeros ($L$) and RHS ($b$), sequentially supplying the data to CUs. Data memory stores the solution vector $x$.
Figure 5: (a) The structure and length of the instruction for each CU. Assuming that there are $\text{2}^N$ CUs, with each CU having $\text{2}^M$ and $\text{2}^K$ words in the $x_i$ and $psum$ register files respectively. Each CU has an addressing depth of $\text{2}^T$ for the data memory. (b) The encoding details of control signals. The definitions of relevant symbols are shown in table \ref{['tab:symbol']}. (c) Automatic write address generation for the register file and data memory.
...and 7 more figures

Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV

TL;DR

Abstract

Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV

Authors

TL;DR

Abstract

Table of Contents

Figures (12)