OffRAC: Offloading Through Remote Accelerator Calls
Ziyi Yang, Krishnan B. Iyer, Yixi Chen, Ran Shu, Zsolt István, Marco Canini, Suhaib A. Fahmy
TL;DR
The paper tackles latency-intensive datacenter workloads by identifying host-managed accelerator offloading as a bottleneck. It proposes OffRAC, a data-path abstraction that decouples data transfer from accelerator invocation, enabling direct, networked FPGA-based accelerator calls via request reassembly and per-accelerator queues. The authors implement a prototype on an Alveo U280 FPGA with multiple accelerators, demonstrating latencies around the tens of microseconds, throughput up to 85 Gbps, and strong multi-tenant isolation with substantial energy efficiency gains over CPU-based execution. They show that request reassembly markedly improves accelerator utilization and throughput, and discuss modular extensions toward dynamic accelerator reconfiguration and orchestration. Overall, OffRAC provides a scalable path to in-network, low-latency acceleration using network-attached FPGAs as first-class compute resources in datacenters.
Abstract
Modern applications increasingly demand ultra-low latency for data processing, often facilitated by host-controlled accelerators like GPUs and FPGAs. However, significant delays result from host involvement in accessing accelerators. To address this limitation, we introduce a novel paradigm we call Offloading through Remote Accelerator Calls (OffRAC), which elevates accelerators to first-class compute resources. OffRAC enables direct calls to FPGA-based accelerators without host involvement. Utilizing the stateless function abstraction of serverless computing, with applications decomposed into simpler stateless functions, offloading promotes efficient acceleration and distribution of computational loads across the network. To realize this proposal, we present a prototype design and implementation of an OffRAC platform for FPGAs that assembles diverse requests from multiple clients into complete accelerator calls with multi-tenancy performance isolation. This design minimizes the implementation complexity for accelerator users while ensuring isolation and programmability. Results show that the OffRAC approach reduces the latency of network calls to accelerators down to approximately 10.5 us, as well as sustaining high application throughput up to 85Gbps, demonstrating scalability and efficiency, making it compelling for the next generation of low-latency applications.
