Table of Contents
Fetching ...

I/O Transit Caching for PMem-based Block Device

Qing Xu, Qisheng Jiang, Chundong Wang

TL;DR

The paper addresses the performance gap of block devices built on PMem that implement block-level write atomicity via BTT, by introducing Caiti, an I/O transit caching mechanism. Caiti uses a DRAM cache organized into sets with eager eviction and conditional bypass, enabling rapid transit of buffered data to PMem while avoiding stalls on full caches or fsyncs; it also exploits multi-core CPUs for concurrency. Through extensive evaluation on Fio, LevelDB, and VM workloads, Caiti achieves up to 3.6x throughput gains over BTT and several I/O-staging baselines, with reduced tail latency and preserved atomicity. The work demonstrates that carefully designed in-CPU caching can unlock the performance potential of PMem-based block devices in real-world storage stacks, improving reliability and efficiency for databases, VMs, and file systems.

Abstract

Byte-addressable non-volatile memory (NVM) sitting on the memory bus is employed to make persistent memory (PMem) in general-purpose computing systems and embedded systems for data storage. Researchers develop software drivers such as the block translation table (BTT) to build block devices on PMem, so programmers can keep using mature and reliable conventional storage stack while expecting high performance by exploiting fast PMem. However, our quantitative study shows that BTT underutilizes PMem and yields inferior performance, due to the absence of the imperative in-device cache. We add a conventional I/O staging cache made of DRAM space to BTT. As DRAM and PMem have comparable access latency, I/O staging cache is likely to be fully filled over time. Continual cache evictions and fsyncs thus cause on-demand flushes with severe stalls, such that the I/O staging cache is concretely unappealing for PMem-based block devices. We accordingly propose an algorithm named Caiti with novel I/O transit caching. Caiti eagerly evicts buffered data to PMem through CPU's multi-cores. It also conditionally bypasses a full cache and directly writes data into PMem to further alleviate I/O stalls. Experiments confirm that Caiti significantly boosts the performance with BTT by up to 3.6x, without loss of block-level write atomicity.

I/O Transit Caching for PMem-based Block Device

TL;DR

The paper addresses the performance gap of block devices built on PMem that implement block-level write atomicity via BTT, by introducing Caiti, an I/O transit caching mechanism. Caiti uses a DRAM cache organized into sets with eager eviction and conditional bypass, enabling rapid transit of buffered data to PMem while avoiding stalls on full caches or fsyncs; it also exploits multi-core CPUs for concurrency. Through extensive evaluation on Fio, LevelDB, and VM workloads, Caiti achieves up to 3.6x throughput gains over BTT and several I/O-staging baselines, with reduced tail latency and preserved atomicity. The work demonstrates that carefully designed in-CPU caching can unlock the performance potential of PMem-based block devices in real-world storage stacks, improving reliability and efficiency for databases, VMs, and file systems.

Abstract

Byte-addressable non-volatile memory (NVM) sitting on the memory bus is employed to make persistent memory (PMem) in general-purpose computing systems and embedded systems for data storage. Researchers develop software drivers such as the block translation table (BTT) to build block devices on PMem, so programmers can keep using mature and reliable conventional storage stack while expecting high performance by exploiting fast PMem. However, our quantitative study shows that BTT underutilizes PMem and yields inferior performance, due to the absence of the imperative in-device cache. We add a conventional I/O staging cache made of DRAM space to BTT. As DRAM and PMem have comparable access latency, I/O staging cache is likely to be fully filled over time. Continual cache evictions and fsyncs thus cause on-demand flushes with severe stalls, such that the I/O staging cache is concretely unappealing for PMem-based block devices. We accordingly propose an algorithm named Caiti with novel I/O transit caching. Caiti eagerly evicts buffered data to PMem through CPU's multi-cores. It also conditionally bypasses a full cache and directly writes data into PMem to further alleviate I/O stalls. Experiments confirm that Caiti significantly boosts the performance with BTT by up to 3.6x, without loss of block-level write atomicity.
Paper Structure (22 sections, 13 figures, 1 table, 1 algorithm)

This paper contains 22 sections, 13 figures, 1 table, 1 algorithm.

Figures (13)

  • Figure 1: An Illustration of Block Translation Table (BTT)
  • Figure 2: A Comparison on BTT, Ext4-DAX, PMem, and PMBD
  • Figure 3: The response time for PMBD and LRU in a window of one million requests
  • Figure 4: Main Components and Write Procedure of Caiti
  • Figure 5: A Comparison with Fio on Average/Runtime Response Time, Tail Latency, and Multi-threads
  • ...and 8 more figures