Table of Contents
Fetching ...

ALADIN: Accuracy-Latency-Aware Design-space Inference Analysis for Embedded AI Accelerators

T. Baldi, D. Casini, A. Biondi

TL;DR

Experimental results highlight how architectural decisions and mixed-precision quantization strategies impact accuracy, latency, and resource usage, and show that these effects can be precisely evaluated and compared using ALADIN, while also revealing subtle optimization tensions.

Abstract

The inference of deep neural networks (DNNs) on resource-constrained embedded systems introduces non-trivial trade-offs among model accuracy, computational latency, and hardware limitations, particularly when real-time constraints must be satisfied. This paper presents ALADIN, an accuracy-latency-aware design-space inference analysis framework for mixed-precision quantized neural networks (QNNs) targeting scratchpad-based AI accelerators. ALADIN enables the evaluation and analysis of inference bottlenecks and design trade-offs across accuracy, latency, and resource consumption without requiring deployment on the target platform, thereby significantly reducing development time and cost. The framework introduces a progressive refinement process that transforms a canonical QONNX model into platform-aware representations by integrating both platform-independent implementation details and hardware-specific characteristics. ALADIN is validated using a cycle-accurate simulator of a RISC-V based platform specialized for AI workloads, demonstrating its effectiveness as a tool for quantitative inference analysis and hardware-software co-design. Experimental results highlight how architectural decisions and mixed-precision quantization strategies impact accuracy, latency, and resource usage, and show that these effects can be precisely evaluated and compared using ALADIN, while also revealing subtle optimization tensions.

ALADIN: Accuracy-Latency-Aware Design-space Inference Analysis for Embedded AI Accelerators

TL;DR

Experimental results highlight how architectural decisions and mixed-precision quantization strategies impact accuracy, latency, and resource usage, and show that these effects can be precisely evaluated and compared using ALADIN, while also revealing subtle optimization tensions.

Abstract

The inference of deep neural networks (DNNs) on resource-constrained embedded systems introduces non-trivial trade-offs among model accuracy, computational latency, and hardware limitations, particularly when real-time constraints must be satisfied. This paper presents ALADIN, an accuracy-latency-aware design-space inference analysis framework for mixed-precision quantized neural networks (QNNs) targeting scratchpad-based AI accelerators. ALADIN enables the evaluation and analysis of inference bottlenecks and design trade-offs across accuracy, latency, and resource consumption without requiring deployment on the target platform, thereby significantly reducing development time and cost. The framework introduces a progressive refinement process that transforms a canonical QONNX model into platform-aware representations by integrating both platform-independent implementation details and hardware-specific characteristics. ALADIN is validated using a cycle-accurate simulator of a RISC-V based platform specialized for AI workloads, demonstrating its effectiveness as a tool for quantitative inference analysis and hardware-software co-design. Experimental results highlight how architectural decisions and mixed-precision quantization strategies impact accuracy, latency, and resource usage, and show that these effects can be precisely evaluated and compared using ALADIN, while also revealing subtle optimization tensions.
Paper Structure (21 sections, 2 equations, 7 figures, 1 table)

This paper contains 21 sections, 2 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Simplified diagram of a platform with a controller and a parallel cluster with 8 cores and 16 memory banks.
  • Figure 2: Example of a DAG representation of a simple CNN consisting of a 2D convolutional layer and a fully connected layer, with a mixed-precision quantization configuration. Circles denote the operation nodes in the DAG, while squares label the edges, representing data dependencies.
  • Figure 3: Design inference analysis workflow.
  • Figure 4: Comparison of the standard 2D convolution (top) and its equivalent implementation via im2col transformation (bottom).
  • Figure 5: Each plot represents a different metric under analysis: (a) compares the network layer-wise in terms of MACs (irrelevant nodes are excluded); (b) shows the memory footprint of each layer; and (c) reports the complexity of each layer in terms of BOPs. ReLU layers are omitted, as their implementation does not vary among the different configurations.
  • ...and 2 more figures