PENDRAM: Enabling High-Performance and Energy-Efficient Processing of Deep Neural Networks through a Generalized DRAM Data Mapping Policy

Rachmad Vidya Wicaksana Putra; Muhammad Abdullah Hanif; Muhammad Shafique

PENDRAM: Enabling High-Performance and Energy-Efficient Processing of Deep Neural Networks through a Generalized DRAM Data Mapping Policy

Rachmad Vidya Wicaksana Putra, Muhammad Abdullah Hanif, Muhammad Shafique

TL;DR

DRAM off-chip memory dominates latency and energy in CNN accelerators. PENDRAM presents a generalized DRAM data mapping policy coupled with a design-space exploration and an analytical $EDP$ model to minimize memory-related costs across DRAM architectures. The approach demonstrates that Mapping-3, when combined with SALP or TL-DRAM and adaptive scheduling, yields the lowest $EDP$ across networks such as AlexNet, VGG-16, MobileNet, and SqueezeNet. This framework enables practical, architecture-agnostic optimization of CNN accelerators for energy-efficient embedded AI applications.

Abstract

Convolutional Neural Networks (CNNs), a prominent type of Deep Neural Networks (DNNs), have emerged as a state-of-the-art solution for solving machine learning tasks. To improve the performance and energy efficiency of CNN inference, the employment of specialized hardware accelerators is prevalent. However, CNN accelerators still face performance- and energy-efficiency challenges due to high off-chip memory (DRAM) access latency and energy, which are especially crucial for latency- and energy-constrained embedded applications. Moreover, different DRAM architectures have different profiles of access latency and energy, thus making it challenging to optimize them for high performance and energy-efficient CNN accelerators. To address this, we present PENDRAM, a novel design space exploration methodology that enables high-performance and energy-efficient CNN acceleration through a generalized DRAM data mapping policy. Specifically, it explores the impact of different DRAM data mapping policies and DRAM architectures across different CNN partitioning and scheduling schemes on the DRAM access latency and energy, then identifies the pareto-optimal design choices. The experimental results show that our DRAM data mapping policy improves the energy-delay-product of DRAM accesses in the CNN accelerator over other mapping policies by up to 96%. In this manner, our PENDRAM methodology offers high-performance and energy-efficient CNN acceleration under any given DRAM architectures for diverse embedded AI applications.

PENDRAM: Enabling High-Performance and Energy-Efficient Processing of Deep Neural Networks through a Generalized DRAM Data Mapping Policy

TL;DR

DRAM off-chip memory dominates latency and energy in CNN accelerators. PENDRAM presents a generalized DRAM data mapping policy coupled with a design-space exploration and an analytical

model to minimize memory-related costs across DRAM architectures. The approach demonstrates that Mapping-3, when combined with SALP or TL-DRAM and adaptive scheduling, yields the lowest

across networks such as AlexNet, VGG-16, MobileNet, and SqueezeNet. This framework enables practical, architecture-agnostic optimization of CNN accelerators for energy-efficient embedded AI applications.

Abstract

Paper Structure (18 sections, 3 equations, 15 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 3 equations, 15 figures, 2 tables, 1 algorithm.

Introduction
The State-of-the-Art and Their Limitations
Motivational Case Study and Scientific Challenges
Our Novel Contributions
Preliminaries
Data Partitioning and Scheduling for CNN Processing
DRAM Fundamentals
Subarray-Level Parallelism (SALP)-enabled DRAM
Tiered-Latency DRAM (TL-DRAM)
The PENDRAM Methodology
The Generalized DRAM Data Mapping Policy
DSE for Evaluating Different DRAM Mapping Policies
Analytical Model for EDP Estimation of DRAM Accesses
Evaluation Methodology
Results and Discussion
...and 3 more sections

Figures (15)

Figure 1: The typical HW architecture of CNN accelerators. Here, each processing element (PE) represents a Multiply-and-Accumulate (MAC) operation.
Figure 2: DRAM latency-per-access and energy-per-access for different access conditions (i.e., a row buffer hit, a row buffer miss, a row buffer conflict, subarray- and bank-level parallelism) in different DRAM architectures (DDR3, SALP-1, SALP-2, SALP-MASA, and TL-DRAM). Data are obtained from our experiments using state-of-the-art cycle-accurate DRAM simulators Ref_Kim_Ramulator_LCA15Ref_Ghose_VAMPIRE_POMACS18 for DDR3-1600 2Gb x8 and SALP 2Gb x8 with 8 subarrays-per-bank.
Figure 3: The overview of our novel contributions, highlighted in blue. In this work, we consider separate on-chip buffers in a CNN accelerator for different data types, i.e., input buffer (iB) for ifms, weight buffer (wB) for wghs, and output buffer (oB) for ofms.
Figure 4: Pseudo-code of the tiled CNN processing. Here, inner loops represent the on-chip processing, outer loops represent the off-chip data access, and $s$ denotes the stride of convolution.
Figure 5: (a) Overview of the DRAM organization. (b) Physical implementation of a DRAM bank, showing multiple subarrays in a bank.
...and 10 more figures

PENDRAM: Enabling High-Performance and Energy-Efficient Processing of Deep Neural Networks through a Generalized DRAM Data Mapping Policy

TL;DR

Abstract

PENDRAM: Enabling High-Performance and Energy-Efficient Processing of Deep Neural Networks through a Generalized DRAM Data Mapping Policy

Authors

TL;DR

Abstract

Table of Contents

Figures (15)