Table of Contents
Fetching ...

Fused-Tiled Layers: Minimizing Data Movement on RISC-V SoCs with Software-Managed Caches

Victor J. B. Jung, Alessio Burrello, Francesco Conti, Luca Benini

TL;DR

This work tackles data movement bottlenecks in RISC-V SoCs with multi-level memory when executing DNNs by introducing Fused-Tiled Layers (FTL), an automatic fusion approach for tiled layers. By formulating layer tiling as a constraint-optimization problem and fusing consecutive layers, FTL minimizes transfers between memory levels and on-chip data movement. Integrated into the Deeploy deployment framework and tested on an extended RV32IMCF-XpulpV2 platform, FTL achieves up to 60.1% runtime reduction (and 47.1% reduction in memory transfers) in a ViT MLP stage. The results demonstrate significant practical benefits for edge DNN inference on RISC-V, highlighting the value of hardware-aware software fusion to reduce off-chip traffic and improve energy efficiency.

Abstract

The success of DNNs and their high computational requirements pushed for large codesign efforts aiming at DNN acceleration. Since DNNs can be represented as static computational graphs, static memory allocation and tiling are two crucial optimizations. Hence, SoCs specialized for DNN acceleration commonly features a multi-level software-managed memory hierarchy. In such architecture, layer-wise tiling, i.e., splitting each layer into multiple sub-nodes, is commonly used; however, while reducing memory occupation, it can increase the total memory transfer, ultimately causing costly off-chip memory copies, which impact energy efficiency and create memory bottlenecks. This work proposes Fused-Tiled Layers, a novel algorithm for automatic fusion between tiled layers. We leverage the flexibility and efficiency of a RISC-V (RV32) heterogeneous SoC to integrate FTL in an open-source deployment framework, which we tune for RISC-V targets. We demonstrate that FTL brings up to 60.1% runtime reduction for a typical MLP stage of ViT due to the reduction of off-chip transfer and on-chip data movement by 47.1%.

Fused-Tiled Layers: Minimizing Data Movement on RISC-V SoCs with Software-Managed Caches

TL;DR

This work tackles data movement bottlenecks in RISC-V SoCs with multi-level memory when executing DNNs by introducing Fused-Tiled Layers (FTL), an automatic fusion approach for tiled layers. By formulating layer tiling as a constraint-optimization problem and fusing consecutive layers, FTL minimizes transfers between memory levels and on-chip data movement. Integrated into the Deeploy deployment framework and tested on an extended RV32IMCF-XpulpV2 platform, FTL achieves up to 60.1% runtime reduction (and 47.1% reduction in memory transfers) in a ViT MLP stage. The results demonstrate significant practical benefits for edge DNN inference on RISC-V, highlighting the value of hardware-aware software fusion to reduce off-chip traffic and improve energy efficiency.

Abstract

The success of DNNs and their high computational requirements pushed for large codesign efforts aiming at DNN acceleration. Since DNNs can be represented as static computational graphs, static memory allocation and tiling are two crucial optimizations. Hence, SoCs specialized for DNN acceleration commonly features a multi-level software-managed memory hierarchy. In such architecture, layer-wise tiling, i.e., splitting each layer into multiple sub-nodes, is commonly used; however, while reducing memory occupation, it can increase the total memory transfer, ultimately causing costly off-chip memory copies, which impact energy efficiency and create memory bottlenecks. This work proposes Fused-Tiled Layers, a novel algorithm for automatic fusion between tiled layers. We leverage the flexibility and efficiency of a RISC-V (RV32) heterogeneous SoC to integrate FTL in an open-source deployment framework, which we tune for RISC-V targets. We demonstrate that FTL brings up to 60.1% runtime reduction for a typical MLP stage of ViT due to the reduction of off-chip transfer and on-chip data movement by 47.1%.

Paper Structure

This paper contains 5 sections, 3 figures.

Figures (3)

  • Figure 1: Overview of the on a and GeLU layer.
  • Figure 2: Overview of the modified Siracusa .
  • Figure 3: Runtime comparison of 's using layer-per-layer tiling (baseline) and on the Siracusa .