Table of Contents
Fetching ...

DecLock: A Case of Decoupled Locking for Disaggregated Memory

Hanze Zhang, Ke Cheng, Rong Chen, Xingda Wei, Haibo Chen

TL;DR

The paper addresses the performance degradation caused by locking in disaggregated memory systems due to MN-NIC contention. It introduces DecLock, a cooperative queue-notify locking protocol (CQL) that decouples lock state maintenance on MNs from ownership transfer across CNs, using an MN-side centralized queue and decentralized CN coordination with an atomic 64-bit header and a non-atomic data plane. It adds a timestamp-based hierarchical locking design to reduce queue sizes while preserving cross-CN fairness. Experiments show substantial gains, including up to 43.37× throughput over RDMA-based spinlocks and up to 1.81× over MCS locks, along with significant tail-latency reductions for DM applications like an object store and the Sherman index, demonstrating practical impact.

Abstract

This paper reveals that locking can significantly degrade the performance of applications on disaggregated memory (DM), sometimes by several orders of magnitude, due to contention on the NICs of memory nodes (MN-NICs). To address this issue, we present DecLock, a locking mechanism for DM that employs decentralized coordination for ownership transfer across compute nodes (CNs) while retaining centralized state maintenance on memory nodes (MNs). DecLock features cooperative queue-notify locking that queues lock waiters on MNs atomically, enabling clients to transfer lock ownership via message-based notifications between CNs. This approach conserves MN-NIC resources for DM applications and ensures fairness. Evaluations show DecLock achieves throughput improvements of up to 43.37$\times$ and 1.81$\times$ over state-of-the-art RDMA-based spinlocks and MCS locks, respectively. Furthermore, DecLock helps two DM applications, including an object store and a real-world database index (Sherman), avoid performance degradation under high contention, improving throughput by up to 35.60$\times$ and 2.31$\times$ and reducing 99th-percentile latency by up to 98.8% and 82.1%.

DecLock: A Case of Decoupled Locking for Disaggregated Memory

TL;DR

The paper addresses the performance degradation caused by locking in disaggregated memory systems due to MN-NIC contention. It introduces DecLock, a cooperative queue-notify locking protocol (CQL) that decouples lock state maintenance on MNs from ownership transfer across CNs, using an MN-side centralized queue and decentralized CN coordination with an atomic 64-bit header and a non-atomic data plane. It adds a timestamp-based hierarchical locking design to reduce queue sizes while preserving cross-CN fairness. Experiments show substantial gains, including up to 43.37× throughput over RDMA-based spinlocks and up to 1.81× over MCS locks, along with significant tail-latency reductions for DM applications like an object store and the Sherman index, demonstrating practical impact.

Abstract

This paper reveals that locking can significantly degrade the performance of applications on disaggregated memory (DM), sometimes by several orders of magnitude, due to contention on the NICs of memory nodes (MN-NICs). To address this issue, we present DecLock, a locking mechanism for DM that employs decentralized coordination for ownership transfer across compute nodes (CNs) while retaining centralized state maintenance on memory nodes (MNs). DecLock features cooperative queue-notify locking that queues lock waiters on MNs atomically, enabling clients to transfer lock ownership via message-based notifications between CNs. This approach conserves MN-NIC resources for DM applications and ensures fairness. Evaluations show DecLock achieves throughput improvements of up to 43.37 and 1.81 over state-of-the-art RDMA-based spinlocks and MCS locks, respectively. Furthermore, DecLock helps two DM applications, including an object store and a real-world database index (Sherman), avoid performance degradation under high contention, improving throughput by up to 35.60 and 2.31 and reducing 99th-percentile latency by up to 98.8% and 82.1%.

Paper Structure

This paper contains 34 sections, 18 figures.

Figures (18)

  • Figure 1: Throughput and tail latency of update operations in a DM database index using different lock mechanisms as #clients increases. Workload: 10 million objects w/ a Zipf access distribution. Testbed: 8 CNs and 1 MN, connected with 100 Gbps RDMA NICs.
  • Figure 2: The architecture of DM (left) and DM applications (right).
  • Figure 3: Acquisition throughput of spinlock and the average number of CAS operations per acquisition (left), Median and 99$^\text{th}$-percentile latency of spinlock acquisition and median data access latency (middle), and Acquisition throughput of different lock mechanisms (right). Detailed experimental setup can be found in §\ref{['sec:eval']}.
  • Figure 4: Architecture, data structure, and workflow of DecLock.
  • Figure 5: The CQL lock header encoding.
  • ...and 13 more figures