Architectural Implications of Neural Network Inference for High Data-Rate, Low-Latency Scientific Applications

Olivia Weng; Alexander Redding; Nhan Tran; Javier Mauricio Duarte; Ryan Kastner

Architectural Implications of Neural Network Inference for High Data-Rate, Low-Latency Scientific Applications

Olivia Weng, Alexander Redding, Nhan Tran, Javier Mauricio Duarte, Ryan Kastner

TL;DR

In this work, it is shown that many scientific NN applications must run fully on chip, in the extreme case requiring a custom chip to meet such stringent constraints.

Abstract

With more scientific fields relying on neural networks (NNs) to process data incoming at extreme throughputs and latencies, it is crucial to develop NNs with all their parameters stored on-chip. In many of these applications, there is not enough time to go off-chip and retrieve weights. Even more so, off-chip memory such as DRAM does not have the bandwidth required to process these NNs as fast as the data is being produced (e.g., every 25 ns). As such, these extreme latency and bandwidth requirements have architectural implications for the hardware intended to run these NNs: 1) all NN parameters must fit on-chip, and 2) codesigning custom/reconfigurable logic is often required to meet these latency and bandwidth constraints. In our work, we show that many scientific NN applications must run fully on chip, in the extreme case requiring a custom chip to meet such stringent constraints.

Architectural Implications of Neural Network Inference for High Data-Rate, Low-Latency Scientific Applications

TL;DR

In this work, it is shown that many scientific NN applications must run fully on chip, in the extreme case requiring a custom chip to meet such stringent constraints.

Abstract

Paper Structure (5 sections, 1 figure, 2 tables)

This paper contains 5 sections, 1 figure, 2 tables.

Introduction
Architectural Implications
All NN parameters must fit on chip
Fully on-chip inference often requires hardware-software codesign
Conclusion

Figures (1)

Figure 1: Many scientific and edge NNs must process incoming data at a high rate, requiring on-chip inference to process the data at least as fast as it arrives duarte2022fastmlwei2023lowborras2022open. This leads to extreme low-latency and high-bandwidth requirements.

Architectural Implications of Neural Network Inference for High Data-Rate, Low-Latency Scientific Applications

TL;DR

Abstract

Architectural Implications of Neural Network Inference for High Data-Rate, Low-Latency Scientific Applications

Authors

TL;DR

Abstract

Table of Contents

Figures (1)