Table of Contents
Fetching ...

SparseMap: Loop Mapping for Sparse CNNs on Streaming Coarse-grained Reconfigurable Array

Xiaobing Ni, Mengke Ge, Jiaheng Ruan, Song Chen, Yi Kang

TL;DR

This work addresses the throughput degradation of streaming CGRAs when accelerating sparse CNNs due to irregular input data causing excessive COPs and MCIDs. It introduces SparseMap, a mapping algorithm that combines efficient I/O data management with scheduling and binding, and employs three key techniques: association oriented input bus allocation, crossbar based multi casting, and reconstruction of internal adder dependencies. Through MIS-based binding on a conflict graph and pre-allocation of routing, SparseMap achieves substantial reductions in COPs and MCIDs while maintaining or improving the initiation interval $II$, with reported COP reductions of up to 92.5% and MCID reductions of 46%, and speedups of 1.5–2.67× over baselines. The approach demonstrates practical impact by enabling higher throughput Sparse CNN acceleration on streaming CGRAs, addressing irregular data patterns inherent to sparse networks.

Abstract

Streaming coarse-grained reconfgurable array (CGRA) is a promising architecture for data/computing-intensive applications because of its fexibility, high throughput and efcient memory system. However,when accelerating sparse CNNs, the irregular input data demands inside sparse CNNs would cause excessive caching operations (COPs) and multi-cycle internal dependencies (MCIDs) between operations, declining the throughput of the streaming CGRA. We propose a mapping method for sparse CNNs onto streaming CGRA, SparseMap, which incorporates an efcient I/O data management along with operation scheduling and binding, to reduce the COPs and MCIDs, thereby ensuring the optimal throughput of streaming CGRA.The experimental results show SparseMap reduces 92.5% COPs and 46.0 % MCIDs while achieves the same or even smaller initiation interval (II) compared to previous works.

SparseMap: Loop Mapping for Sparse CNNs on Streaming Coarse-grained Reconfigurable Array

TL;DR

This work addresses the throughput degradation of streaming CGRAs when accelerating sparse CNNs due to irregular input data causing excessive COPs and MCIDs. It introduces SparseMap, a mapping algorithm that combines efficient I/O data management with scheduling and binding, and employs three key techniques: association oriented input bus allocation, crossbar based multi casting, and reconstruction of internal adder dependencies. Through MIS-based binding on a conflict graph and pre-allocation of routing, SparseMap achieves substantial reductions in COPs and MCIDs while maintaining or improving the initiation interval , with reported COP reductions of up to 92.5% and MCID reductions of 46%, and speedups of 1.5–2.67× over baselines. The approach demonstrates practical impact by enabling higher throughput Sparse CNN acceleration on streaming CGRAs, addressing irregular data patterns inherent to sparse networks.

Abstract

Streaming coarse-grained reconfgurable array (CGRA) is a promising architecture for data/computing-intensive applications because of its fexibility, high throughput and efcient memory system. However,when accelerating sparse CNNs, the irregular input data demands inside sparse CNNs would cause excessive caching operations (COPs) and multi-cycle internal dependencies (MCIDs) between operations, declining the throughput of the streaming CGRA. We propose a mapping method for sparse CNNs onto streaming CGRA, SparseMap, which incorporates an efcient I/O data management along with operation scheduling and binding, to reduce the COPs and MCIDs, thereby ensuring the optimal throughput of streaming CGRA.The experimental results show SparseMap reduces 92.5% COPs and 46.0 % MCIDs while achieves the same or even smaller initiation interval (II) compared to previous works.

Paper Structure

This paper contains 16 sections, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Streaming CGRA.
  • Figure 2: SparseMap overview.
  • Figure 3: (a) A s-DFG computing 4 channels from 4 kernels; (b) Input associations; (c) A scheduling with 3 MCIDs; (d) A scheduling with 1 MCID.
  • Figure 4: (a) The input data $c_0$ with 5 multiplications exceeding fan-out PEs on an input bus in a 4$\times$4 PEA; (b) A COP $c$ inserted into s-DFG; (c) (d) Multi-casting $c_0$ to 2 input buses by crossbar.
  • Figure 5: (a) A kernel with 4 multiplications and 1 adder tree composed by 3 additions; (b) A scheduling with 3 MCIDs for the fixed adder tree. (c) A scheduling with 1 MCID for the reconstructed adder tree; (d) The final kernel after reconstructing the internal dependencies within the adder tree.
  • ...and 1 more figures