FPGA Deployment of LFADS for Real-time Neuroscience Experiments

Xiaohan Liu; ChiJui Chen; YanLun Huang; LingChi Yang; Elham E Khoda; Yihui Chen; Scott Hauck; Shih-Chieh Hsu; Bo-Cheng Lai

FPGA Deployment of LFADS for Real-time Neuroscience Experiments

Xiaohan Liu, ChiJui Chen, YanLun Huang, LingChi Yang, Elham E Khoda, Yihui Chen, Scott Hauck, Shih-Chieh Hsu, Bo-Cheng Lai

TL;DR

This work addresses the challenge of real-time inference for LFADS on hardware by deploying an FPGA-accelerated LFADS implementation within the hls4ml framework. The authors compare post-training quantization (PTQ) and quantization-aware training (QAT) in both Keras and QKeras contexts, achieving sub-50 microsecond latency ($41.97\,bcs$) on a Xilinx Alveo U55C with a 16-bit fixed-point representation and demonstrating that 10-bit QAT maintains near-floating-point performance. Key contributions include a practical HLS/Keras pathway for Bidirectional GRU deployment, a QKeras-based quantized variant, and an IO-optimized HLS implementation that enables real-time LFADS processing on FPGA hardware. The results indicate that lossy quantization can substantially reduce resource usage while preserving accuracy, enabling large-scale real-time neuroscience experiments and paving the way for closed-loop brain-machine interfaces. The work also outlines an automated workflow and discusses scalability considerations for future VAE-based LFADS deployments on FPGAs.

Abstract

Large-scale recordings of neural activity are providing new opportunities to study neural population dynamics. A powerful method for analyzing such high-dimensional measurements is to deploy an algorithm to learn the low-dimensional latent dynamics. LFADS (Latent Factor Analysis via Dynamical Systems) is a deep learning method for inferring latent dynamics from high-dimensional neural spiking data recorded simultaneously in single trials. This method has shown a remarkable performance in modeling complex brain signals with an average inference latency in milliseconds. As our capacity of simultaneously recording many neurons is increasing exponentially, it is becoming crucial to build capacity for deploying low-latency inference of the computing algorithms. To improve the real-time processing ability of LFADS, we introduce an efficient implementation of the LFADS models onto Field Programmable Gate Arrays (FPGA). Our implementation shows an inference latency of 41.97 $μ$s for processing the data in a single trial on a Xilinx U55C.

FPGA Deployment of LFADS for Real-time Neuroscience Experiments

TL;DR

) on a Xilinx Alveo U55C with a 16-bit fixed-point representation and demonstrating that 10-bit QAT maintains near-floating-point performance. Key contributions include a practical HLS/Keras pathway for Bidirectional GRU deployment, a QKeras-based quantized variant, and an IO-optimized HLS implementation that enables real-time LFADS processing on FPGA hardware. The results indicate that lossy quantization can substantially reduce resource usage while preserving accuracy, enabling large-scale real-time neuroscience experiments and paving the way for closed-loop brain-machine interfaces. The work also outlines an automated workflow and discusses scalability considerations for future VAE-based LFADS deployments on FPGAs.

Abstract

s for processing the data in a single trial on a Xilinx U55C.

Paper Structure (13 sections, 8 figures, 1 table)

This paper contains 13 sections, 8 figures, 1 table.

Introduction
Core Concepts
Model Description
Dataset and Evaluation Metrics
Implementation
HLS implementation: Keras Model
QKeras Model
HLS implementation: QKeras Model
Results
Quantization Results
Resource Utilization
FPGA Latency
Summary and Outlook

Figures (8)

Figure 1: LFADS architecture used for this study.
Figure 2: The figure shows the NPLL values of the AE-based LFADS (red) and a VAE-based LFADS (blue) with a similar architecture.
Figure 3: The structure of 64-units quantized GRU cell
Figure 4: The plots of 4-bits quantized activations, quantized hard activations, and real activations (a) quantized sigmoid and quantized hard sigmoid ranged from 0 to 0.9375. (b) quantized tanh and quantized hard tanh ranged from -1 to 0.875.
Figure 5: Shows (a) NPLL and (b) R$^2$ score as a function of fractional bits. The blue line in each figure represents the floating-point, whereas the lines correspond to inter bits of 4 (orange), 6 (green), or 8 (red).
...and 3 more figures

FPGA Deployment of LFADS for Real-time Neuroscience Experiments

TL;DR

Abstract

FPGA Deployment of LFADS for Real-time Neuroscience Experiments

Authors

TL;DR

Abstract

Table of Contents

Figures (8)