Table of Contents
Fetching ...

RISC-V Based TinyML Accelerator for Depthwise Separable Convolutions in Edge AI

Muhammed Yildirim, Ozcan Ozturk

TL;DR

The paper tackles the memory-wall bottleneck in edge DSC-based CNNs by proposing a zero-buffer, fused pixel-wise dataflow that streams data directly through the MobileNetV2 DSC block (Expansion→Depthwise→Projection) without intermediate feature-map storage. Implemented as a RISC-V CFU within the CFU-Playground framework, the design achieves end-to-end fusion via a three-engine pipeline, on-the-fly padding, and a heterogeneous on-chip memory system. FPGA evaluations show up to 59.3× speedup over a software baseline, while ASIC synthesis at 40 nm and 28 nm confirms compact area and sub-watt power suitable for TinyML. Compared to prior DSC accelerators, this work reduces memory traffic by about 87% and demonstrates substantial performance gains without requiring large buffers, illustrating a practical path to high-efficiency edge AI accelerators.

Abstract

The increasing demand for on-device intelligence in Edge AI and TinyML applications requires the efficient execution of modern Convolutional Neural Networks (CNNs). While lightweight architectures like MobileNetV2 employ Depthwise Separable Convolutions (DSC) to reduce computational complexity, their multi-stage design introduces a critical performance bottleneck inherent to layer-by-layer execution: the high energy and latency cost of transferring intermediate feature maps to either large on-chip buffers or off-chip DRAM. To address this memory wall, this paper introduces a novel hardware accelerator architecture that utilizes a fused pixel-wise dataflow. Implemented as a Custom Function Unit (CFU) for a RISC-V processor, our architecture eliminates the need for intermediate buffers entirely, reducing the data movement up to 87\% compared to conventional layer-by-layer execution. It computes a single output pixel to completion across all DSC stages-expansion, depthwise convolution, and projection-by streaming data through a tightly-coupled pipeline without writing to memory. Evaluated on a Xilinx Artix-7 FPGA, our design achieves a speedup of up to 59.3x over the baseline software execution on the RISC-V core. Furthermore, ASIC synthesis projects a compact 0.284 mm$^2$ footprint with 910 mW power at 2 GHz in 28 nm, and a 1.20 mm$^2$ footprint with 233 mW power at 300 MHz in 40 nm. This work confirms the feasibility of a zero-buffer dataflow within a TinyML resource envelope, offering a novel and effective strategy for overcoming the memory wall in edge AI accelerators.

RISC-V Based TinyML Accelerator for Depthwise Separable Convolutions in Edge AI

TL;DR

The paper tackles the memory-wall bottleneck in edge DSC-based CNNs by proposing a zero-buffer, fused pixel-wise dataflow that streams data directly through the MobileNetV2 DSC block (Expansion→Depthwise→Projection) without intermediate feature-map storage. Implemented as a RISC-V CFU within the CFU-Playground framework, the design achieves end-to-end fusion via a three-engine pipeline, on-the-fly padding, and a heterogeneous on-chip memory system. FPGA evaluations show up to 59.3× speedup over a software baseline, while ASIC synthesis at 40 nm and 28 nm confirms compact area and sub-watt power suitable for TinyML. Compared to prior DSC accelerators, this work reduces memory traffic by about 87% and demonstrates substantial performance gains without requiring large buffers, illustrating a practical path to high-efficiency edge AI accelerators.

Abstract

The increasing demand for on-device intelligence in Edge AI and TinyML applications requires the efficient execution of modern Convolutional Neural Networks (CNNs). While lightweight architectures like MobileNetV2 employ Depthwise Separable Convolutions (DSC) to reduce computational complexity, their multi-stage design introduces a critical performance bottleneck inherent to layer-by-layer execution: the high energy and latency cost of transferring intermediate feature maps to either large on-chip buffers or off-chip DRAM. To address this memory wall, this paper introduces a novel hardware accelerator architecture that utilizes a fused pixel-wise dataflow. Implemented as a Custom Function Unit (CFU) for a RISC-V processor, our architecture eliminates the need for intermediate buffers entirely, reducing the data movement up to 87\% compared to conventional layer-by-layer execution. It computes a single output pixel to completion across all DSC stages-expansion, depthwise convolution, and projection-by streaming data through a tightly-coupled pipeline without writing to memory. Evaluated on a Xilinx Artix-7 FPGA, our design achieves a speedup of up to 59.3x over the baseline software execution on the RISC-V core. Furthermore, ASIC synthesis projects a compact 0.284 mm footprint with 910 mW power at 2 GHz in 28 nm, and a 1.20 mm footprint with 233 mW power at 300 MHz in 40 nm. This work confirms the feasibility of a zero-buffer dataflow within a TinyML resource envelope, offering a novel and effective strategy for overcoming the memory wall in edge AI accelerators.

Paper Structure

This paper contains 18 sections, 4 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: MobileNetV2 - residual block: Comparison of standard convolution (top) and depthwise separable convolution (bottom). DSC factorizes the operation into depthwise and pointwise stages, significantly reducing computation and parameter count.
  • Figure 2: CPU-CFU interface using R-type instruction of RISC-V ISA in CFU-Playground framework.
  • Figure 3: Comparison of DSC accelerator architectures. (a) The Unified Architecture, which uses a single engine for all stages, creating a significant off-chip memory traffic bottleneck. (b) The Separated Architecture, which reduces off-chip traffic but introduces a new bottleneck with the large on-chip buffer required for inter-engine communication. (c) Our Proposed Fused Architecture, which eliminates the intermediate buffer entirely by streaming data directly between engines, solving both prior bottlenecks.
  • Figure 4:
  • Figure 5: High-level block diagram of the proposed fused DSC accelerator. The architecture features a three-stage pipeline with dedicated parallel engines for the Expansion, Depthwise, and Projection stages, orchestrated by an Instruction Controller. Specialized memory structures, such as the parallel IFMAP and Dw-Weights Buffers, are designed to maximize data throughput. The legend details the color-coded paths for control, weights, IFMAP, and intermediate data flowing through the system.
  • ...and 9 more figures