Table of Contents
Fetching ...

Arcalis: Accelerating Remote Procedure Calls Using a Lightweight Near-Cache Solution

Johnson Umeike, Pongstorn Maidee, Bahar Asgari

TL;DR

Arcalis is a near-cache RPC accelerator that positions a lightweight hardware engine adjacent to the last-level cache (LLC) and achieves up to a 1.62 times higher throughput than prior solutions, highlighting the potential of near-cache RPC acceleration as a practical solution for high-performance microservice deployment.

Abstract

Modern microservices increasingly depend on high-performance remote procedure calls (RPCs) to coordinate fine-grained, distributed computation. As network bandwidths continue to scale, the CPU overhead associated with RPC processing, particularly serialization, deserialization, and protocol handling, has become a critical bottleneck. This challenge is exacerbated by fast user-space networking stacks such as DPDK, which expose RPC processing as the dominant performance limiter. While prior work has explored software optimizations and FPGA-based offload engines, these approaches remain physically distant from the CPU's memory hierarchy, incurring unnecessary data movement and cache pollution. We present Arcalis, a near-cache RPC accelerator that positions a lightweight hardware engine adjacent to the last-level cache (LLC). Arcalis offloads RPC processing to dedicated microengines on receive and transmit paths that operate with cache-line latency while preserving programmability. By decoupling RPC processing logic, enabling microservice-specific execution, and positioning itself near the LLC to immediately consume data injected by network cards, Arcalis achieves 1.79-4.16$\times$ end-to-end speedup compared to the CPU baseline, while significantly reducing microarchitectural overhead by up to 88%, and achieves up to a 1.62$\times$ higher throughput than prior solutions. These results highlight the potential of near-cache RPC acceleration as a practical solution for high-performance microservice deployment.

Arcalis: Accelerating Remote Procedure Calls Using a Lightweight Near-Cache Solution

TL;DR

Arcalis is a near-cache RPC accelerator that positions a lightweight hardware engine adjacent to the last-level cache (LLC) and achieves up to a 1.62 times higher throughput than prior solutions, highlighting the potential of near-cache RPC acceleration as a practical solution for high-performance microservice deployment.

Abstract

Modern microservices increasingly depend on high-performance remote procedure calls (RPCs) to coordinate fine-grained, distributed computation. As network bandwidths continue to scale, the CPU overhead associated with RPC processing, particularly serialization, deserialization, and protocol handling, has become a critical bottleneck. This challenge is exacerbated by fast user-space networking stacks such as DPDK, which expose RPC processing as the dominant performance limiter. While prior work has explored software optimizations and FPGA-based offload engines, these approaches remain physically distant from the CPU's memory hierarchy, incurring unnecessary data movement and cache pollution. We present Arcalis, a near-cache RPC accelerator that positions a lightweight hardware engine adjacent to the last-level cache (LLC). Arcalis offloads RPC processing to dedicated microengines on receive and transmit paths that operate with cache-line latency while preserving programmability. By decoupling RPC processing logic, enabling microservice-specific execution, and positioning itself near the LLC to immediately consume data injected by network cards, Arcalis achieves 1.79-4.16 end-to-end speedup compared to the CPU baseline, while significantly reducing microarchitectural overhead by up to 88%, and achieves up to a 1.62 higher throughput than prior solutions. These results highlight the potential of near-cache RPC acceleration as a practical solution for high-performance microservice deployment.
Paper Structure (26 sections, 16 figures, 5 tables)

This paper contains 26 sections, 16 figures, 5 tables.

Figures (16)

  • Figure 1: RPC enables cross-language, platform-independent communication between heterogenous microservices.
  • Figure 2: RPC processing pipeline in modern microservices deployments. Stages 1-3, 5-6 are implemented within standard RPC frameworksthriftgrpc.
  • Figure 3: RPC communication workflow. After service definition and code generation (steps 1-2), the client invokes a remote function, which must go through serialization, network transmission, and deserialization (steps 3-6). The server processes the request and returns a response following the reverse path (steps 7-9), with client and server stubs abstracting the underlying communication.
  • Figure 4: Effectiveness of Kernel-Bypass Networking for RPC Execution: (a) Kernel and DPDK Receive RPC Processing Paths. DPDK directly access the Network Interface hardware using its Poll Mode Driver (PMD (b) Throughput (Thousand Requests Per Second) and Average Latency for Memcached RPC (0.8 SET ratio) using DPDK and Kernel Networking.)
  • Figure 5: Performance Characterization of RPC Microservices: (a) Pipeline slot utilization showing front-end and back-end stalls, (b)-(d) Roofline analysis demonstrating memory bandwidth bottlenecks in kernel execution
  • ...and 11 more figures