Table of Contents
Fetching ...

Varuna: Enabling Failure-Type Aware RDMA Failover

Xiaoyang Wang, Yongkun Li, Lulu Yao, Guoli Wei, Longcheng Yang, Yinlong Xu, Weiqing Kong, Weiguang Wang, Peng Dong, Bingyang Liu

Abstract

RDMA link failures can render connections temporarily unavailable, causing both performance degradation and significant recovery overhead. To tolerate such failures, production datacenters assign each primary link with a standby link and, upon failure, uniformly retransmit all in-flight RDMA request over the backup path. However, we observe that such blanket retransmission is unnecessary. In-flight requests can be split into pre-failure and post-failure categories depending on whether the responder has already executed. Retransmitting post-failure requests is not only redundant (consuming bandwidth), but also incorrect for non-idempotent operations, where duplicate execution can violate application semantics. We present Varuna, a failure-type-aware RDMA recovery mechanism that enables correct retransmission and us-level failover. Varuna piggybacks a lightweight completion log on every RDMA operation; after a link failure, this log deterministically reveals which in-flight requests were executed (post-failure) and which were lost (pre-failure). Varuna then retransmits only the pre-failure subset and fetches/recovers the return values for post-failure requests. Evaluated using synthetic microbenchmarks and end-to-end RDMA TPC-C transactions, Varuna incurs only 0.6-10% steady-state latency overhead in realistic applications, eliminates 65% of recovery retransmission time, preserves transactional consistency, and introduces zero connectivity rebuild overhead and negligible memory overhead during RDMA failover.

Varuna: Enabling Failure-Type Aware RDMA Failover

Abstract

RDMA link failures can render connections temporarily unavailable, causing both performance degradation and significant recovery overhead. To tolerate such failures, production datacenters assign each primary link with a standby link and, upon failure, uniformly retransmit all in-flight RDMA request over the backup path. However, we observe that such blanket retransmission is unnecessary. In-flight requests can be split into pre-failure and post-failure categories depending on whether the responder has already executed. Retransmitting post-failure requests is not only redundant (consuming bandwidth), but also incorrect for non-idempotent operations, where duplicate execution can violate application semantics. We present Varuna, a failure-type-aware RDMA recovery mechanism that enables correct retransmission and us-level failover. Varuna piggybacks a lightweight completion log on every RDMA operation; after a link failure, this log deterministically reveals which in-flight requests were executed (post-failure) and which were lost (pre-failure). Varuna then retransmits only the pre-failure subset and fetches/recovers the return values for post-failure requests. Evaluated using synthetic microbenchmarks and end-to-end RDMA TPC-C transactions, Varuna incurs only 0.6-10% steady-state latency overhead in realistic applications, eliminates 65% of recovery retransmission time, preserves transactional consistency, and introduces zero connectivity rebuild overhead and negligible memory overhead during RDMA failover.

Paper Structure

This paper contains 21 sections, 14 figures, 4 algorithms.

Figures (14)

  • Figure 1: Transmission Stages and Failure Types: If a network failure occurs before execution, the entire request is considered un-executed (pre-failure). Otherwise, the request execution is completed at the responder, but the ACK may be lost (post-failure), and no retransmission is required.
  • Figure 2: Retransmission Flow: When retransmitting using a backup link with a backup QP, the standard RDMA retransmission detection mechanism is bypassed. As a result, previously transmitted requests may be applied again at the receiver, leading to redundant execution.
  • Figure 3: Post-Failure in RDMA: (a) Post-failure occurrences are unpredictable across different workloads. (b) Identifying post-failure requests reduces unnecessary resend overhead.
  • Figure 4: Varuna Overview: Completion logs and extended status provide durable evidence of RDMA failures, and vQPs backed by lightweight DCQPs enable immediate failover.
  • Figure 5: Completion Log: Varuna records a log at the responder for each request to distinguish pre-failure operations.
  • ...and 9 more figures