Table of Contents
Fetching ...

Faster Offloads by Unloading them -- The RDMA Case

Georgia Fragkouli, Laurent Vanbever

TL;DR

This paper tackles the observation that full hardware offloads, while often beneficial, do not universally yield speedups. It introduces uRDMA, a bidirectional offload framework that can dynamically unload part of an offloaded task, focusing on RDMA writes to reduce RNIC cache misses and PCIe translation overhead. The approach comprises a decision module that routes requests to either an offload or an unload path and an unload module that executes the unloaded portions on the CPU, preserving data and security semantics. Preliminary evaluation with NVIDIA ConnectX-5 Ex RNICs shows up to 31% RTT latency improvement and demonstrates adaptive unloading can match or outperform fixed offload/unload strategies across different memory-region workloads. The work lays a foundation for extending bidirectional unloading to other offloads and discusses compatibility, potential hardware integration, and generalization challenges.

Abstract

From hardware offloads like RDMA to software ones like eBPF, offloads are everywhere and their value is in performance. However, there is evidence that fully offloading -- even when feasible -- does not always give the expected speedups. Starting from the observation that this is due to changes the offloads make -- by moving tasks from the application/CPU closer to the network/link layer -- we argue that to further accelerate offloads, we need to make offloads reversible by unloading them -- moving back part of the offloaded tasks. Unloading comes with a set of challenges that we start answering in this paper by focusing on (offloaded) RDMA writes: which part of the write operation does it make sense to unload? how do we dynamically decide which writes to execute on the unload or offload path to improve performance? how do we maintain compatibility between the two paths? Our current prototype shows the potential of unloading by accelerating RDMA writes by up to 31%.

Faster Offloads by Unloading them -- The RDMA Case

TL;DR

This paper tackles the observation that full hardware offloads, while often beneficial, do not universally yield speedups. It introduces uRDMA, a bidirectional offload framework that can dynamically unload part of an offloaded task, focusing on RDMA writes to reduce RNIC cache misses and PCIe translation overhead. The approach comprises a decision module that routes requests to either an offload or an unload path and an unload module that executes the unloaded portions on the CPU, preserving data and security semantics. Preliminary evaluation with NVIDIA ConnectX-5 Ex RNICs shows up to 31% RTT latency improvement and demonstrates adaptive unloading can match or outperform fixed offload/unload strategies across different memory-region workloads. The work lays a foundation for extending bidirectional unloading to other offloads and discusses compatibility, potential hardware integration, and generalization challenges.

Abstract

From hardware offloads like RDMA to software ones like eBPF, offloads are everywhere and their value is in performance. However, there is evidence that fully offloading -- even when feasible -- does not always give the expected speedups. Starting from the observation that this is due to changes the offloads make -- by moving tasks from the application/CPU closer to the network/link layer -- we argue that to further accelerate offloads, we need to make offloads reversible by unloading them -- moving back part of the offloaded tasks. Unloading comes with a set of challenges that we start answering in this paper by focusing on (offloaded) RDMA writes: which part of the write operation does it make sense to unload? how do we dynamically decide which writes to execute on the unload or offload path to improve performance? how do we maintain compatibility between the two paths? Our current prototype shows the potential of unloading by accelerating RDMA writes by up to 31%.

Paper Structure

This paper contains 34 sections, 3 figures.

Figures (3)

  • Figure 1: uRDMA makes offload operations reversible and dynamically decides whether to unload or offload.
  • Figure 2: uRDMA architecture. Orange dashed arrows show additional traversals in the absence of uRDMA.
  • Figure 3: Unloading improves RTT latency by up to 31%, and adaptive achieves the overall best RTT latency.