Table of Contents
Fetching ...

DEX: Scalable Range Indexing on Disaggregated Memory [Extended Version]

Baotong Lu, Kaisong Huang, Chieh-Jan Mike Liang, Tianzheng Wang, Eric Lo

TL;DR

DEX addresses the challenge of scaling range indexes on memory-disaggregated systems by combining compute-side logical partitioning, path-aware caching of both inner and leaf nodes, and cost-aware opportunistic offloading to memory-side CPUs. By colocating subtrees under level $M$ on memory servers and giving each compute server a disjoint key range, DEX dramatically reduces cross-server coherence and remote accesses while maintaining correctness through optimistic locks and selective invalidation. Across a four-server RDMA cluster, DEX achieves substantial throughput gains over Sherman and SMART, with up to 2.5–9.6× improvements depending on workload, and significant benefits from caching and offloading under skewed and uniform workloads. The work demonstrates that a carefully engineered combination of caching, partitioning, and pushdown can unlock scalable, memory-efficient range indexes in disaggregated memory environments, with practical implications for cost and utilization in data centers.

Abstract

Memory disaggregation can potentially allow memory-optimized range indexes such as B+-trees to scale beyond one machine while attaining high hardware utilization and low cost. Designing scalable indexes on disaggregated memory, however, is challenging due to rudimentary caching, unprincipled offloading and excessive inconsistency among servers. This paper proposes DEX, a new scalable B+-tree for memory disaggregation. DEX includes a set of techniques to reduce remote accesses, including logical partitioning, lightweight caching and cost-aware offloading. Our evaluation shows that DEX can outperform the state-of-the-art by 1.7--56.3X, and the advantage remains under various setups, such as cache size and skewness.

DEX: Scalable Range Indexing on Disaggregated Memory [Extended Version]

TL;DR

DEX addresses the challenge of scaling range indexes on memory-disaggregated systems by combining compute-side logical partitioning, path-aware caching of both inner and leaf nodes, and cost-aware opportunistic offloading to memory-side CPUs. By colocating subtrees under level on memory servers and giving each compute server a disjoint key range, DEX dramatically reduces cross-server coherence and remote accesses while maintaining correctness through optimistic locks and selective invalidation. Across a four-server RDMA cluster, DEX achieves substantial throughput gains over Sherman and SMART, with up to 2.5–9.6× improvements depending on workload, and significant benefits from caching and offloading under skewed and uniform workloads. The work demonstrates that a carefully engineered combination of caching, partitioning, and pushdown can unlock scalable, memory-efficient range indexes in disaggregated memory environments, with practical implications for cost and utilization in data centers.

Abstract

Memory disaggregation can potentially allow memory-optimized range indexes such as B+-trees to scale beyond one machine while attaining high hardware utilization and low cost. Designing scalable indexes on disaggregated memory, however, is challenging due to rudimentary caching, unprincipled offloading and excessive inconsistency among servers. This paper proposes DEX, a new scalable B+-tree for memory disaggregation. DEX includes a set of techniques to reduce remote accesses, including logical partitioning, lightweight caching and cost-aware offloading. Our evaluation shows that DEX can outperform the state-of-the-art by 1.7--56.3X, and the advantage remains under various setups, such as cache size and skewness.
Paper Structure (28 sections, 18 figures, 3 tables, 1 algorithm)

This paper contains 28 sections, 18 figures, 3 tables, 1 algorithm.

Figures (18)

  • Figure 1: Desiderata of indexes on disaggregated memory. Caching should work with the smaller speed gap between local and remote memory, and limited local memory. Offloading should be aware of the scarcity of memory-side compute. Design should recognize potential data inconsistencies among servers (red arrows).
  • Figure 2: Overview of DEX. Each compute server "owns" a disjoint range of the key space and caches tree traversal paths in local DRAM. Upon cache misses, the compute server selectively offloads index operations when profitable. B+-tree nodes are distributed onto memory servers. However, subtrees under level $M$ are all located in the same memory servers to avoid expensive pointer chasing across memory servers during offloading.
  • Figure 3: DEX caching in a compute server. Potential eviction candidates are first admitted to the cooling map which is a hash table of FIFO arrays to alleviate contention. To admit a new node (N9), the first thread that accesses it signals in-progress RDMA by atomically setting an I/O flag in the mapping table. Subsequent concurrent threads will then re-traverse the path from root to avoid repeatedly issuing RDMA by multiple threads for the same node.
  • Figure 4: DEX's scalability with different cooling structures.
  • Figure 5: DEX's throughput under different offloading policies, the cache size is set to $1\%$ of the data.
  • ...and 13 more figures