Table of Contents
Fetching ...

VersaSlot: Efficient Fine-grained FPGA Sharing with Big.Little Slots and Live Migration in FPGA Cluster

Jianfeng Gu, Hao Wang, Xiaorang Guo, Martin Schulz, Michael Gerndt

TL;DR

VersaSlot tackles dynamic, fine-grained FPGA sharing by introducing Big.Little slot architecture to mitigate partial reconfiguration contention and a dual-core scheduling framework that decouples PR from execution. The system enables seamless cross-board switching and live migration, backed by an ILP-guided slot allocator and 3-in-1 task bundling to boost utilization. Experimental results on Xilinx UltraScale+ hardware show substantial reductions in average and tail latency (up to 13.66x and improvements over Nimblock) and notable LUT/FF gains (35%/29%). This work demonstrates practical, scalable FPGA multiplexing for data-center clusters with reduced blocking and overhead.

Abstract

As FPGAs gain popularity for on-demand application acceleration in data center computing, dynamic partial reconfiguration (DPR) has become an effective fine-grained sharing technique for FPGA multiplexing. However, current FPGA sharing encounters partial reconfiguration contention and task execution blocking problems introduced by the DPR, which significantly degrade application performance. In this paper, we propose VersaSlot, an efficient spatio-temporal FPGA sharing system with novel Big{.}Little slot architecture that can effectively resolve the contention and task blocking while improving resource utilization. For the heterogeneous Big{.}Little architecture, we introduce an efficient slot allocation and scheduling algorithm, along with a seamless cross-board switching and live migration mechanism, to maximize FPGA multiplexing across the cluster. We evaluate the VersaSlot system on an FPGA cluster composed of the latest Xilinx UltraScale+ FPGAs (ZCU216) and compare its performance against four existing scheduling algorithms. The results demonstrate that VersaSlot achieves up to 13.66x lower average response time than the traditional temporal FPGA multiplexing, and up to 2.19x average response time improvement over the state-of-the-art spatio-temporal sharing systems. Furthermore, VersaSlot enhances the LUT and FF resource utilization by 35% and 29% on average, respectively.

VersaSlot: Efficient Fine-grained FPGA Sharing with Big.Little Slots and Live Migration in FPGA Cluster

TL;DR

VersaSlot tackles dynamic, fine-grained FPGA sharing by introducing Big.Little slot architecture to mitigate partial reconfiguration contention and a dual-core scheduling framework that decouples PR from execution. The system enables seamless cross-board switching and live migration, backed by an ILP-guided slot allocator and 3-in-1 task bundling to boost utilization. Experimental results on Xilinx UltraScale+ hardware show substantial reductions in average and tail latency (up to 13.66x and improvements over Nimblock) and notable LUT/FF gains (35%/29%). This work demonstrates practical, scalable FPGA multiplexing for data-center clusters with reduced blocking and overhead.

Abstract

As FPGAs gain popularity for on-demand application acceleration in data center computing, dynamic partial reconfiguration (DPR) has become an effective fine-grained sharing technique for FPGA multiplexing. However, current FPGA sharing encounters partial reconfiguration contention and task execution blocking problems introduced by the DPR, which significantly degrade application performance. In this paper, we propose VersaSlot, an efficient spatio-temporal FPGA sharing system with novel Big{.}Little slot architecture that can effectively resolve the contention and task blocking while improving resource utilization. For the heterogeneous Big{.}Little architecture, we introduce an efficient slot allocation and scheduling algorithm, along with a seamless cross-board switching and live migration mechanism, to maximize FPGA multiplexing across the cluster. We evaluate the VersaSlot system on an FPGA cluster composed of the latest Xilinx UltraScale+ FPGAs (ZCU216) and compare its performance against four existing scheduling algorithms. The results demonstrate that VersaSlot achieves up to 13.66x lower average response time than the traditional temporal FPGA multiplexing, and up to 2.19x average response time improvement over the state-of-the-art spatio-temporal sharing systems. Furthermore, VersaSlot enhances the LUT and FF resource utilization by 35% and 29% on average, respectively.

Paper Structure

This paper contains 16 sections, 1 equation, 8 figures, 2 algorithms.

Figures (8)

  • Figure 1: VersaSlot system with PS and PL in FPGA cluster.
  • Figure 2: Versaslot with Big.Little and Only.Little dual-core scheduling alleviates the PR contention and task execution blocking problems, thus reducing application response time.
  • Figure 3: Parallel and serial bundling for the 3-in-1 task.
  • Figure 4: Cross-board switching loop with the buffer zone.
  • Figure 5: Relative response time reduction under different congestion conditions, normalized to the baseline.
  • ...and 3 more figures