Table of Contents
Fetching ...

MultiVic: A Time-Predictable RISC-V Multi-Core Processor Optimized for Neural Network Inference

Maximilian Kirschner, Konstantin Dudzik, Ben Krusekamp, Jürgen Becker

TL;DR

The paper tackles the need for high-performance neural network inference in real-time systems with strict timing guarantees. It introduces a parameterized, time-predictable multi-core vector processor where each worker core has private scratchpads and a central management core controls data transfers to DRAM via a static schedule, ensuring predictable timing. Across multiple configurations, the study finds that many smaller cores can outperform a single large vector core by increasing effective memory bandwidth and clock frequency, while maintaining very low execution-time variability; however, scalability is limited by FPGA routing beyond about 16 cores. The work provides an open-source FPGA implementation and demonstrates that time-predictable, multi-core architectures can deliver higher performance without sacrificing real-time determinism, paving the way for reliable NN inference in safety-critical applications.

Abstract

Real-time systems, particularly those used in domains like automated driving, are increasingly adopting neural networks. From this trend arises the need for high-performance hardware exhibiting predictable timing behavior. While state-of-the-art real-time hardware often suffers from limited memory and compute resources, modern AI accelerators typically lack the crucial predictability due to memory interference. We present a new hardware architecture to bridge this gap between performance and predictability. The architecture features a multi-core vector processor with predictable cores, each equipped with local scratchpad memories. A central management core orchestrates access to shared external memory following a statically determined schedule. To evaluate the proposed hardware architecture, we analyze different variants of our parameterized design. We compare these variants to a baseline architecture consisting of a single-core vector processor with large vector registers. We find that configurations with a larger number of smaller cores achieve better performance due to increased effective memory bandwidth and higher clock frequencies. Crucially for real-time systems, execution time fluctuation remains very low, demonstrating the platform's time predictability.

MultiVic: A Time-Predictable RISC-V Multi-Core Processor Optimized for Neural Network Inference

TL;DR

The paper tackles the need for high-performance neural network inference in real-time systems with strict timing guarantees. It introduces a parameterized, time-predictable multi-core vector processor where each worker core has private scratchpads and a central management core controls data transfers to DRAM via a static schedule, ensuring predictable timing. Across multiple configurations, the study finds that many smaller cores can outperform a single large vector core by increasing effective memory bandwidth and clock frequency, while maintaining very low execution-time variability; however, scalability is limited by FPGA routing beyond about 16 cores. The work provides an open-source FPGA implementation and demonstrates that time-predictable, multi-core architectures can deliver higher performance without sacrificing real-time determinism, paving the way for reliable NN inference in safety-critical applications.

Abstract

Real-time systems, particularly those used in domains like automated driving, are increasingly adopting neural networks. From this trend arises the need for high-performance hardware exhibiting predictable timing behavior. While state-of-the-art real-time hardware often suffers from limited memory and compute resources, modern AI accelerators typically lack the crucial predictability due to memory interference. We present a new hardware architecture to bridge this gap between performance and predictability. The architecture features a multi-core vector processor with predictable cores, each equipped with local scratchpad memories. A central management core orchestrates access to shared external memory following a statically determined schedule. To evaluate the proposed hardware architecture, we analyze different variants of our parameterized design. We compare these variants to a baseline architecture consisting of a single-core vector processor with large vector registers. We find that configurations with a larger number of smaller cores achieve better performance due to increased effective memory bandwidth and higher clock frequencies. Crucially for real-time systems, execution time fluctuation remains very low, demonstrating the platform's time predictability.

Paper Structure

This paper contains 13 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The proposed multi-core processor architecture, here with four cores
  • Figure 2: Baseline architecture with a single core and similar peripherals
  • Figure 3: Roofline plot comparing theoretical performance
  • Figure 4: Median execution time and standard deviation on the matmul benchmark
  • Figure 5: FPGA resource consumption on VCU128