Table of Contents
Fetching ...

DAXFS: A Lock-Free Shared Filesystem for CXL Disaggregated Memory

Cong Wang, Yiwei Yang, Yusheng Zheng

Abstract

CXL (Compute Express Link) enables multiple hosts to share byte-addressable memory with hardware cache coherence, but no existing filesystem exploits this for lock-free multi-host coordination. We present DaxFS, a Linux filesystem for CXL shared memory that uses cmpxchg atomic operations, which CXL makes coherent across host boundaries, as its sole coordination primitive. A CAS-based hash overlay enables lock-free concurrent writes from multiple hosts without any centralized coordinator. A cooperative shared page cache with a novel multi-host clock eviction algorithm (MH-clock) provides demand-paged caching in shared DAX memory, with fully decentralized victim selection via cmpxchg. We validate multi-host correctness using QEMU-emulated CXL 3.0, where two virtual hosts share a memory region with TCP-forwarded atomics. Under cross-host contention, DaxFS maintains >99% CAS accuracy with no lost updates. On single-host DRAM-backed DAX, DaxFS exceeds tmpfs throughput across all write workloads, achieving up to 2.68x higher random write throughput with 4 threads and 1.18x higher random read throughput at 64 KB. Preliminary GPU microbenchmarks show that the cmpxchg-based design extends to GPU threads performing page cache operations at PCIe 5.0 bandwidth limits.

DAXFS: A Lock-Free Shared Filesystem for CXL Disaggregated Memory

Abstract

CXL (Compute Express Link) enables multiple hosts to share byte-addressable memory with hardware cache coherence, but no existing filesystem exploits this for lock-free multi-host coordination. We present DaxFS, a Linux filesystem for CXL shared memory that uses cmpxchg atomic operations, which CXL makes coherent across host boundaries, as its sole coordination primitive. A CAS-based hash overlay enables lock-free concurrent writes from multiple hosts without any centralized coordinator. A cooperative shared page cache with a novel multi-host clock eviction algorithm (MH-clock) provides demand-paged caching in shared DAX memory, with fully decentralized victim selection via cmpxchg. We validate multi-host correctness using QEMU-emulated CXL 3.0, where two virtual hosts share a memory region with TCP-forwarded atomics. Under cross-host contention, DaxFS maintains >99% CAS accuracy with no lost updates. On single-host DRAM-backed DAX, DaxFS exceeds tmpfs throughput across all write workloads, achieving up to 2.68x higher random write throughput with 4 threads and 1.18x higher random read throughput at 64 KB. Preliminary GPU microbenchmarks show that the cmpxchg-based design extends to GPU threads performing page cache operations at PCIe 5.0 bandwidth limits.

Paper Structure

This paper contains 52 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: DaxFS memory layout on a CXL shared memory region. Multiple hosts and GPU accelerators access the same DAX-mapped region concurrently: reads resolve from the overlay down to the base image; writes insert via cmpxchg; the shared page cache serves all hosts and GPUs. GPUs map the region via dma-buf and coordinate using PCIe AtomicOp TLPs.
  • Figure 2: Sequential read throughput comparison across filesystems (Intel Xeon Gold 5418Y, DRAM-backed DAX). Left: 16 MB sequential read; right: 1 MB sequential read. DaxFS achieves 59.2 MB/s on the large sequential workload (6.2$\times$ ext4-dax) and matches tmpfs on the 1 MB file (100 MB/s each). The 16 MB gap versus tmpfs (228.6 MB/s) reflects the overhead of per-page overlay resolution on large sequential scans; at 1 MB the overhead is amortized.
  • Figure 3: Sequential read latency comparison (lower is better). DaxFS completes a 16 MB read in 274.2 ms, 6.1$\times$ faster than ext4-dax (1,660.7 ms). On 1 MB files, DaxFS and tmpfs finish in 18.0 and 17.7 ms respectively.
  • Figure 4: Metadata operation latency on 1,000 files (lower is better). DaxFS achieves the lowest stat latency (3,528 ms), 4.6% faster than ext4-dax (3,699 ms) and 6.3% faster than tmpfs (3,764 ms), due to flat overlay lookup. On readdir, DaxFS (33.8 ms) is slower than tmpfs (23.1 ms) and ext4-dax (23.9 ms) due to the cost of iterating both the base image dirent array and the overlay linked list.
  • Figure 5: Normalised performance summary across all benchmarks (Intel Xeon Gold 5418Y, DRAM-backed DAX). Each group is normalised to the best-performing filesystem. DaxFS matches tmpfs on 1 MB sequential reads and achieves the lowest stat latency, while outperforming ext4-dax on data I/O. The 16 MB sequential read and readdir gaps versus tmpfs reflect per-page overlay lookup and dual-source directory iteration overhead.
  • ...and 2 more figures