Table of Contents
Fetching ...

INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing

Stefan Abi-Karam, Rishov Sarkar, Dejia Xu, Zhiwen Fan, Zhangyang Wang, Cong Hao

TL;DR

INR-Arch tackles the challenge of efficiently computing arbitrary-order gradients for implicit neural representations by introducing a dataflow FPGA framework that uses FIFO-based array streams and a compiler to map gradient graphs to high-performance hardware. The approach combines a streaming dataflow architecture, graph extraction and optimization, deadlock analysis, FIFO depth optimization, and template-based code generation to produce synthesizable HLS designs. Empirical results on INR editing show substantial gains: up to several-fold speedups over CPU/GPU, with major reductions in memory usage and energy-delay product, especially for higher-order gradients. This work enables practical, energy-efficient INR editing and provides open-source tooling to encourage broader adoption and extension to more models and gradient orders.

Abstract

An increasing number of researchers are finding use for nth-order gradient computations for a wide variety of applications, including graphics, meta-learning (MAML), scientific computing, and most recently, implicit neural representations (INRs). Recent work shows that the gradient of an INR can be used to edit the data it represents directly without needing to convert it back to a discrete representation. However, given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient due to the higher demand for computing power and higher complexity in data movement. This makes it a promising target for FPGA acceleration. In this work, we introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We address this problem in two phases. First, we design a dataflow architecture that uses FIFO streams and an optimized computation kernel library, ensuring high memory efficiency and parallel computation. Second, we propose a compiler that extracts and optimizes computation graphs, automatically configures hardware parameters such as latency and stream depths to optimize throughput, while ensuring deadlock-free operation, and outputs High-Level Synthesis (HLS) code for FPGA implementation. We utilize INR editing as our benchmark, presenting results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively. Furthermore, we obtain 3.1-8.9x and 1.7-4.3x lower memory usage, and 1.7-11.3x and 5.5-32.8x lower energy-delay product. Our framework will be made open-source and available on GitHub.

INR-Arch: A Dataflow Architecture and Compiler for Arbitrary-Order Gradient Computations in Implicit Neural Representation Processing

TL;DR

INR-Arch tackles the challenge of efficiently computing arbitrary-order gradients for implicit neural representations by introducing a dataflow FPGA framework that uses FIFO-based array streams and a compiler to map gradient graphs to high-performance hardware. The approach combines a streaming dataflow architecture, graph extraction and optimization, deadlock analysis, FIFO depth optimization, and template-based code generation to produce synthesizable HLS designs. Empirical results on INR editing show substantial gains: up to several-fold speedups over CPU/GPU, with major reductions in memory usage and energy-delay product, especially for higher-order gradients. This work enables practical, energy-efficient INR editing and provides open-source tooling to encourage broader adoption and extension to more models and gradient orders.

Abstract

An increasing number of researchers are finding use for nth-order gradient computations for a wide variety of applications, including graphics, meta-learning (MAML), scientific computing, and most recently, implicit neural representations (INRs). Recent work shows that the gradient of an INR can be used to edit the data it represents directly without needing to convert it back to a discrete representation. However, given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient due to the higher demand for computing power and higher complexity in data movement. This makes it a promising target for FPGA acceleration. In this work, we introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We address this problem in two phases. First, we design a dataflow architecture that uses FIFO streams and an optimized computation kernel library, ensuring high memory efficiency and parallel computation. Second, we propose a compiler that extracts and optimizes computation graphs, automatically configures hardware parameters such as latency and stream depths to optimize throughput, while ensuring deadlock-free operation, and outputs High-Level Synthesis (HLS) code for FPGA implementation. We utilize INR editing as our benchmark, presenting results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively. Furthermore, we obtain 3.1-8.9x and 1.7-4.3x lower memory usage, and 1.7-11.3x and 5.5-32.8x lower energy-delay product. Our framework will be made open-source and available on GitHub.
Paper Structure (23 sections, 8 figures, 4 tables)

This paper contains 23 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: A visual overview of A) Implicit Neural Representations (INRs), and B) INR Editing using the INSP-Net architectureinrdsp.
  • Figure 2: An overview of the INR-Arch framework for end-to-end hardware acceleration for INR editing based on the INSP-Net inrdsp architecture.
  • Figure 3: Illustration of the array_stream data structure, the library of stream-based kernels, and an example compute graph mapped to a dataflow architecture.
  • Figure 4: Visualization of the computation graph merging optimization. Similar computations are indicated with identical colors to represent their presence both within and across graphs. The merging of these graphs effectively minimizes redundant computations.
  • Figure 5: An example of a computation graph that causes deadlock with default FIFO sizing for any non-trivial input. The root cause is the contention between the "Mm" which buffers elements with a delay before writing out data and "Cos" which writes out data every cycle.
  • ...and 3 more figures