Table of Contents
Fetching ...

OffRAC: Offloading Through Remote Accelerator Calls

Ziyi Yang, Krishnan B. Iyer, Yixi Chen, Ran Shu, Zsolt István, Marco Canini, Suhaib A. Fahmy

TL;DR

The paper tackles latency-intensive datacenter workloads by identifying host-managed accelerator offloading as a bottleneck. It proposes OffRAC, a data-path abstraction that decouples data transfer from accelerator invocation, enabling direct, networked FPGA-based accelerator calls via request reassembly and per-accelerator queues. The authors implement a prototype on an Alveo U280 FPGA with multiple accelerators, demonstrating latencies around the tens of microseconds, throughput up to 85 Gbps, and strong multi-tenant isolation with substantial energy efficiency gains over CPU-based execution. They show that request reassembly markedly improves accelerator utilization and throughput, and discuss modular extensions toward dynamic accelerator reconfiguration and orchestration. Overall, OffRAC provides a scalable path to in-network, low-latency acceleration using network-attached FPGAs as first-class compute resources in datacenters.

Abstract

Modern applications increasingly demand ultra-low latency for data processing, often facilitated by host-controlled accelerators like GPUs and FPGAs. However, significant delays result from host involvement in accessing accelerators. To address this limitation, we introduce a novel paradigm we call Offloading through Remote Accelerator Calls (OffRAC), which elevates accelerators to first-class compute resources. OffRAC enables direct calls to FPGA-based accelerators without host involvement. Utilizing the stateless function abstraction of serverless computing, with applications decomposed into simpler stateless functions, offloading promotes efficient acceleration and distribution of computational loads across the network. To realize this proposal, we present a prototype design and implementation of an OffRAC platform for FPGAs that assembles diverse requests from multiple clients into complete accelerator calls with multi-tenancy performance isolation. This design minimizes the implementation complexity for accelerator users while ensuring isolation and programmability. Results show that the OffRAC approach reduces the latency of network calls to accelerators down to approximately 10.5 us, as well as sustaining high application throughput up to 85Gbps, demonstrating scalability and efficiency, making it compelling for the next generation of low-latency applications.

OffRAC: Offloading Through Remote Accelerator Calls

TL;DR

The paper tackles latency-intensive datacenter workloads by identifying host-managed accelerator offloading as a bottleneck. It proposes OffRAC, a data-path abstraction that decouples data transfer from accelerator invocation, enabling direct, networked FPGA-based accelerator calls via request reassembly and per-accelerator queues. The authors implement a prototype on an Alveo U280 FPGA with multiple accelerators, demonstrating latencies around the tens of microseconds, throughput up to 85 Gbps, and strong multi-tenant isolation with substantial energy efficiency gains over CPU-based execution. They show that request reassembly markedly improves accelerator utilization and throughput, and discuss modular extensions toward dynamic accelerator reconfiguration and orchestration. Overall, OffRAC provides a scalable path to in-network, low-latency acceleration using network-attached FPGAs as first-class compute resources in datacenters.

Abstract

Modern applications increasingly demand ultra-low latency for data processing, often facilitated by host-controlled accelerators like GPUs and FPGAs. However, significant delays result from host involvement in accessing accelerators. To address this limitation, we introduce a novel paradigm we call Offloading through Remote Accelerator Calls (OffRAC), which elevates accelerators to first-class compute resources. OffRAC enables direct calls to FPGA-based accelerators without host involvement. Utilizing the stateless function abstraction of serverless computing, with applications decomposed into simpler stateless functions, offloading promotes efficient acceleration and distribution of computational loads across the network. To realize this proposal, we present a prototype design and implementation of an OffRAC platform for FPGAs that assembles diverse requests from multiple clients into complete accelerator calls with multi-tenancy performance isolation. This design minimizes the implementation complexity for accelerator users while ensuring isolation and programmability. Results show that the OffRAC approach reduces the latency of network calls to accelerators down to approximately 10.5 us, as well as sustaining high application throughput up to 85Gbps, demonstrating scalability and efficiency, making it compelling for the next generation of low-latency applications.

Paper Structure

This paper contains 25 sections, 21 figures, 6 tables.

Figures (21)

  • Figure 1: A comparison of network offload approaches. Compared to a full software stack (a), SmartNICs (b) allow some packet-level network functionality to be moved out of software, to enhance ingestion of data into applications running on a host. Heterogenous systems that consist of both software and hardware (c) add accelerators, which can offload parts of complex applications. Some such frameworks offer direct connectivity between accelerators and the network but this is a secondary interface to the host-based management of the FPGA. OffRAC (d) is fully contained in an FPGA and allows accelerators to service complete requests at a coarser granularity than packets, while allowing accelerators to be swapped at runtime.
  • Figure 2: Latency comparison of traditional host-controlled and direct function invocation and composition using ClickNP, showing 50th, 25th, and 75th percentiles.
  • Figure 3: Overview of OffRAC operating model. Multiple clients (outline color) can issue requests addressing different accelerators (fill color), which are fragmented over multiple packets, the first of which determines the size. These fragments are reassembled into complete requests and dispatched to the corresponding accelerator. Reassembly is performed in buffers that are agnostic to client and accelerator for resource efficiency.
  • Figure 4:
  • Figure 5:
  • ...and 16 more figures