Table of Contents
Fetching ...

An Analytical Cost Model for Fast Evaluation of Multiple Compute-Engine CNN Accelerators

Fareed Qararyah, Mohammad Ali Maleki, Pedro Trancoso

TL;DR

The paper tackles the challenge of selecting among diverse CE arrangements for CNN accelerators on FPGAs, where traditional synthesis-based evaluation is prohibitively slow. It introduces MCCM, an analytical, bottom-up cost model plus a modular evaluation methodology that can express any multiple-CE accelerator and rapidly estimate throughput, latency, on-chip buffers, and off-chip accesses. MCCM achieves about $100000 \times$ faster evaluation with average accuracy $> 90\%$ and is validated across multiple architectures, CNN models, and FPGA boards, demonstrating that no single CE arrangement is universally best and enabling effective design-space exploration. The practical impact is a scalable, accurate, and fast framework that guides energy- and area-aware accelerator design, enabling performance improvements over state-of-the-art designs through systematic exploration.

Abstract

Convolutional Neural Networks (CNNs) serve various applications with diverse performance and resource requirements. Model-aware CNN accelerators best address these diverse requirements. These accelerators usually combine multiple dedicated Compute Engines (CEs). The flexibility of Field-Programmable Gate Arrays (FPGAs) enables the design of such multiple Compute-Engine (multiple-CE) accelerators. However, existing multiple-CE accelerators differ in how they arrange their CEs and distribute the FPGA resources and CNN operators among the CEs. The design space of multiple-CE accelerators comprises numerous such arrangements, which makes a systematic identification of the best ones an open challenge. This paper proposes a multiple-CE accelerator analytical Cost Model (MCCM) and an evaluation methodology built around MCCM. The model and methodology streamline the expression of any multiple-CE accelerator and provide a fast evaluation of its performance and efficiency. MCCM is in the order of 100000x faster than traditional synthesis-based evaluation and has an average accuracy of > 90%. The paper presents three use cases of MCCM. The first describes an end-to-end evaluation of state-of-the-art multiple-CE accelerators considering various metrics, CNN models, and resource budgets. The second describes fine-grained evaluation that helps identify performance bottlenecks of multiple-CE accelerators. The third demonstrates that MCCM fast evaluation enables exploring the vast design space of multiple-CE accelerators. These use cases show that no unique CE arrangement achieves the best results given different metrics, CNN models, and resource budgets. They also show that fast evaluation enables design space exploration, resulting in accelerator designs that outperform state-of-the-art ones. MCCM is available at https://github.com/fqararyah/MCCM.

An Analytical Cost Model for Fast Evaluation of Multiple Compute-Engine CNN Accelerators

TL;DR

The paper tackles the challenge of selecting among diverse CE arrangements for CNN accelerators on FPGAs, where traditional synthesis-based evaluation is prohibitively slow. It introduces MCCM, an analytical, bottom-up cost model plus a modular evaluation methodology that can express any multiple-CE accelerator and rapidly estimate throughput, latency, on-chip buffers, and off-chip accesses. MCCM achieves about faster evaluation with average accuracy and is validated across multiple architectures, CNN models, and FPGA boards, demonstrating that no single CE arrangement is universally best and enabling effective design-space exploration. The practical impact is a scalable, accurate, and fast framework that guides energy- and area-aware accelerator design, enabling performance improvements over state-of-the-art designs through systematic exploration.

Abstract

Convolutional Neural Networks (CNNs) serve various applications with diverse performance and resource requirements. Model-aware CNN accelerators best address these diverse requirements. These accelerators usually combine multiple dedicated Compute Engines (CEs). The flexibility of Field-Programmable Gate Arrays (FPGAs) enables the design of such multiple Compute-Engine (multiple-CE) accelerators. However, existing multiple-CE accelerators differ in how they arrange their CEs and distribute the FPGA resources and CNN operators among the CEs. The design space of multiple-CE accelerators comprises numerous such arrangements, which makes a systematic identification of the best ones an open challenge. This paper proposes a multiple-CE accelerator analytical Cost Model (MCCM) and an evaluation methodology built around MCCM. The model and methodology streamline the expression of any multiple-CE accelerator and provide a fast evaluation of its performance and efficiency. MCCM is in the order of 100000x faster than traditional synthesis-based evaluation and has an average accuracy of > 90%. The paper presents three use cases of MCCM. The first describes an end-to-end evaluation of state-of-the-art multiple-CE accelerators considering various metrics, CNN models, and resource budgets. The second describes fine-grained evaluation that helps identify performance bottlenecks of multiple-CE accelerators. The third demonstrates that MCCM fast evaluation enables exploring the vast design space of multiple-CE accelerators. These use cases show that no unique CE arrangement achieves the best results given different metrics, CNN models, and resource budgets. They also show that fast evaluation enables design space exploration, resulting in accelerator designs that outperform state-of-the-art ones. MCCM is available at https://github.com/fqararyah/MCCM.

Paper Structure

This paper contains 29 sections, 10 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Compute-Engine (CE) parallelism examples
  • Figure 2: CNN to multiple-CE architecture mapping examples. For example, in Segmented, CE1 processes layers L1-L4, CE2 processes layers L5 and L6, and so on. In practice, CE and buffer sizes are proportional to the segment layers' compute and memory requirements, respectively.
  • Figure 3: Overview of multiple-CE evaluation methodology.
  • Figure 4: Sequential (layer by layer) and pipelined processing of three convolutional layers
  • Figure 5: Throughput vs. off-chip memory accesses of ResNet50 on ZC706 using 10 accelerator instances per architecture with 2-11 CEs. The numbers indicate the CE counts of the accelerators with the highest throughput or minimum accesses of each architecture.
  • ...and 5 more figures