SAF: Scalable Acceleration Framework for dynamic and flexible scaling of FPGAs
Masudul Hassan Quraishi, Michael Riera, Fengbo Ren, Aman Arora, Aviral Shrivastava
TL;DR
This paper tackles the scaling bottleneck of FPGA deployments by introducing SAF, an Ethernet-based framework that enables hot-plug, stand-alone FPGAs to connect to a remote host without a local CPU. SAF employs a custom shell and standalone accelerator protocols to support automatic discovery, multi-FPGA partial reconfiguration, memory management, and kernel execution over Ethernet. Empirical results with 20 Arria-10 FPGAs show SAF delivers up to $13×$ faster reconfiguration, $21 ext{%-}38 ext{%}$ lower setup costs, and nearly linear performance scaling, along with $25 ext{%}$ runtime and $27 ext{%}$ energy reductions in on-demand scaling scenarios. The approach offers a practical path to scalable, cost-effective FPGA acceleration for cloud and edge workloads, leveraging remote hosting and network-based orchestration.
Abstract
FPGAs are increasingly gaining traction in cloud and edge computing environments due to their hardware flexibility, low latency, and low energy consumption. However, the existing hardware stack of FPGA and the host-FPGA connectivity does not allow flexible scaling and simultaneous reconfiguration of multiple devices, which limits the adoption of FPGA at scale. In this paper, we present SAF -- an Ethernet-based scalable acceleration framework that allows FPGA to be hot-plugged into a network in a stand-alone fashion without connecting to a local host CPU, which enables flexible scalability. SAF provides a custom FPGA shell and a set of Ethernet protocols that allow FPGAs to connect with a remote host to accelerate application kernels. SAF can configure multiple FPGAs simultaneously, which significantly reduces the reconfiguration time in scaling effort. We implemented the SAF framework using Intel FPGA SDK for OpenCL and 20 Bittware 385A cards with Arria-10 FPGAs. We analyze a case study and conduct experiments to compare SAF with state-of-the-art multi-FPGA clusters. Results show that SAF provides 13X faster reconfiguration than sequential PCIe programming, reduces the hardware setup costs by 38%, application runtime by 25%, and energy consumption by 27%. We evaluated the performance scalability of SAF using the PTRANS benchmark of the HPCC FPGA benchmark suite and showed an almost linear speedup for strong and weak scaling scenarios.
