Table of Contents
Fetching ...

Hardware Software Optimizations for Fast Model Recovery on Reconfigurable Architectures

Bin Xu, Ayan Banerjee, Sandeep Gupta

TL;DR

This work tackles the latency and energy challenges of Model Recovery (MR) for physical AI at the edge, where GPU-based MR struggles due to iterative ODE solvers. It introduces MERINDA, an FPGA-friendly MR framework that rewrites MR as a streaming GRU-based neural flow, paired with hardware-aware memory and compute co-design (BRAM tiling, fixed-point arithmetic, and LUT/DSP synergy) to sustain streaming throughput with II close to 1. The paper delivers concrete hardware architectures, a formal equivalence to NODE-based MR, and extensive FPGA evaluations showing competitive accuracy with orders-of-magnitude improvements in cycle count, throughput, and energy versus prior LTC/GPU baselines, demonstrated on four benchmark datasets including Automated Insulin Delivery. These results point to practical, real-time MR deployments at the edge, enabling robust digital-twin and safety-critical workflows with reduced data movement and power consumption.

Abstract

Model Recovery (MR) is a core primitive for physical AI and real-time digital twins, but GPUs often execute MR inefficiently due to iterative dependencies, kernel-launch overheads, underutilized memory bandwidth, and high data-movement latency. We present MERINDA, an FPGA-accelerated MR framework that restructures computation as a streaming dataflow pipeline. MERINDA exploits on-chip locality through BRAM tiling, fixed-point kernels, and the concurrent use of LUT fabric and carry-chain adders to expose fine-grained spatial parallelism while minimizing off-chip traffic. This hardware-aware formulation removes synchronization bottlenecks and sustains high throughput across the iterative updates in MR. On representative MR workloads, MERINDA delivers up to 6.3x fewer cycles than an FPGA-based LTC baseline, enabling real-time performance for time-critical physical systems.

Hardware Software Optimizations for Fast Model Recovery on Reconfigurable Architectures

TL;DR

This work tackles the latency and energy challenges of Model Recovery (MR) for physical AI at the edge, where GPU-based MR struggles due to iterative ODE solvers. It introduces MERINDA, an FPGA-friendly MR framework that rewrites MR as a streaming GRU-based neural flow, paired with hardware-aware memory and compute co-design (BRAM tiling, fixed-point arithmetic, and LUT/DSP synergy) to sustain streaming throughput with II close to 1. The paper delivers concrete hardware architectures, a formal equivalence to NODE-based MR, and extensive FPGA evaluations showing competitive accuracy with orders-of-magnitude improvements in cycle count, throughput, and energy versus prior LTC/GPU baselines, demonstrated on four benchmark datasets including Automated Insulin Delivery. These results point to practical, real-time MR deployments at the edge, enabling robust digital-twin and safety-critical workflows with reduced data movement and power consumption.

Abstract

Model Recovery (MR) is a core primitive for physical AI and real-time digital twins, but GPUs often execute MR inefficiently due to iterative dependencies, kernel-launch overheads, underutilized memory bandwidth, and high data-movement latency. We present MERINDA, an FPGA-accelerated MR framework that restructures computation as a streaming dataflow pipeline. MERINDA exploits on-chip locality through BRAM tiling, fixed-point kernels, and the concurrent use of LUT fabric and carry-chain adders to expose fine-grained spatial parallelism while minimizing off-chip traffic. This hardware-aware formulation removes synchronization bottlenecks and sustains high throughput across the iterative updates in MR. On representative MR workloads, MERINDA delivers up to 6.3x fewer cycles than an FPGA-based LTC baseline, enabling real-time performance for time-critical physical systems.

Paper Structure

This paper contains 30 sections, 20 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Equivalent GRU-based neural network (NN) architecture for MR.
  • Figure 2: Optimization Framework of MERINDA.
  • Figure 3: Generic architecture for physics-guided inferencing to be used in physical AI.
  • Figure 4: MERINDA: GRU NN-based MR architecture.
  • Figure 5: Overall GRU accelerator architecture. The PE Array is backed by on-chip BRAM and connected to a Memory Reader/Writer that streams data to/from off-memory (DDR).
  • ...and 3 more figures