Table of Contents
Fetching ...

A Configurable and Efficient Memory Hierarchy for Neural Network Hardware Accelerator

Oliver Bause, Paul Palomero Bernardo, Oliver Bringmann

TL;DR

This paper tackles the memory bottleneck in neural network hardware accelerators by introducing a configurable, on-demand memory hierarchy that can have up to five levels plus an optional shift register. The approach leverages per-layer memory access patterns and a pattern-driven prefetching strategy, guided by loop-nest analyses, to minimize on-chip capacity while preserving throughput. Key contributions include a detailed architecture (input buffer, multi-level hierarchy, memory-control unit, pattern calculations, and an optional output shift register), verification via cocotb and a Python model, and a UltraTrail TC-ResNet case study showing up to $62.2\%$ chip-area reduction with only $2.4\%$ performance loss. The results demonstrate practical impact for reducing chip area and enabling flexible, pattern-aware memory management in DNN accelerators, with ongoing work targeting broader pattern support and energy optimizations.

Abstract

As machine learning applications continue to evolve, the demand for efficient hardware accelerators, specifically tailored for deep neural networks (DNNs), becomes increasingly vital. In this paper, we propose a configurable memory hierarchy framework tailored for per layer adaptive memory access patterns of DNNs. The hierarchy requests data on-demand from the off-chip memory to provide it to the accelerator's compute units. The objective is to strike an optimized balance between minimizing the required memory capacity and maintaining high accelerator performance. The framework is characterized by its configurability, allowing the creation of a tailored memory hierarchy with up to five levels. Furthermore, the framework incorporates an optional shift register as final level to increase the flexibility of the memory management process. A comprehensive loop-nest analysis of DNN layers shows that the framework can efficiently execute the access patterns of most loop unrolls. Synthesis results and a case study of the DNN accelerator UltraTrail indicate a possible reduction in chip area of up to 62.2% as smaller memory modules can be used. At the same time, the performance loss can be minimized to 2.4%.

A Configurable and Efficient Memory Hierarchy for Neural Network Hardware Accelerator

TL;DR

This paper tackles the memory bottleneck in neural network hardware accelerators by introducing a configurable, on-demand memory hierarchy that can have up to five levels plus an optional shift register. The approach leverages per-layer memory access patterns and a pattern-driven prefetching strategy, guided by loop-nest analyses, to minimize on-chip capacity while preserving throughput. Key contributions include a detailed architecture (input buffer, multi-level hierarchy, memory-control unit, pattern calculations, and an optional output shift register), verification via cocotb and a Python model, and a UltraTrail TC-ResNet case study showing up to chip-area reduction with only performance loss. The results demonstrate practical impact for reducing chip area and enabling flexible, pattern-aware memory management in DNN accelerators, with ongoing work targeting broader pattern support and energy optimizations.

Abstract

As machine learning applications continue to evolve, the demand for efficient hardware accelerators, specifically tailored for deep neural networks (DNNs), becomes increasingly vital. In this paper, we propose a configurable memory hierarchy framework tailored for per layer adaptive memory access patterns of DNNs. The hierarchy requests data on-demand from the off-chip memory to provide it to the accelerator's compute units. The objective is to strike an optimized balance between minimizing the required memory capacity and maintaining high accelerator performance. The framework is characterized by its configurability, allowing the creation of a tailored memory hierarchy with up to five levels. Furthermore, the framework incorporates an optional shift register as final level to increase the flexibility of the memory management process. A comprehensive loop-nest analysis of DNN layers shows that the framework can efficiently execute the access patterns of most loop unrolls. Synthesis results and a case study of the DNN accelerator UltraTrail indicate a possible reduction in chip area of up to 62.2% as smaller memory modules can be used. At the same time, the performance loss can be minimized to 2.4%.
Paper Structure (23 sections, 12 figures, 2 tables)

This paper contains 23 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Schematic view of the memory access patterns (a) sequential, (b) cyclic, (c) shifted cyclic, (d) strided, (e) pseudo-random, and (f) parallel-shifted cyclic. The number left to a memory bank is its start address. Note that all banks of the cyclic patterns start with 0 to highlight that they access the same addresses multiple times. This figure is inspired by jang2010exploiting.
  • Figure 2: Overview of the memory framework's architectural design. This configuration is equipped with the optional and two hierarchy levels where level 0 is a single-ported module with a higher capacity than the dual-ported memory of level 1. Note that level 1 needs two address buses (adr.), one for each port.
  • Figure 3: Clock domain crossing between the input buffer and the first hierarchy level. The input buffer is clocked by the faster external clock source while the memory hierarchy is clocked by the slower on-chip clock. To synchronize the data transfer, the two control wires, buffer full and reset buffer, are required.
  • Figure 4: Waveform of read and write cycles of the two hierarchy levels L0 and L1. L0 is a single-ported module that prefers write over read accesses. L0 read_write toggles between read ($=0$) and write ($=1$) cycles. If L0 is in a write cycle, L0 write_address is forwarded to the address port of the memory module, otherwise L0 read_address is inserted. The reading of address 8 is postponed until the write into address 9 is complete. L1 is dual-ported so potential read requests can be ignored. The last read cycle at address 10 cannot be executed yet, since it is still waiting for data to be written into 10 first.
  • Figure 5: Impact on the required clock cycles to output 5,000 data words by increasing cycle lengths from 8 to 1,024 for the three given configurations with a depth in level 1 of 32, 128, and 512, respectively. Each configuration was simulated with and without data preloading enabled.
  • ...and 7 more figures