Table of Contents
Fetching ...

Towards Efficient Flash Caches with Emerging NVMe Flexible Data Placement SSDs

Michael Allison, Arun George, Javier Gonzalez, Dan Helmick, Vikash Kumar, Roshan Nair, Vivek Shah

TL;DR

The paper addresses the high device-level write amplification (DLWA) in Flash-based caches and its carbon implications. It proposes NVMe FDP-driven data placement to separate small, hot data from large, cold data within the CacheLib architecture, enabling targeted isolation of SOC and LOC data without altering cache design. A theoretical DLWA and CO2e model is developed and coupled with an implementation that introduces FDP-aware placement handles and I/O management, validated on production traces from Meta and Twitter. Results show DLWA near 1 and substantial reductions in embodied and operational carbon, along with improved SSD utilization and feasibility of multi-tenant deployments, illustrating a practical path to carbon-efficient Flash caches at scale.

Abstract

NVMe Flash-based SSDs are widely deployed in data centers to cache working sets of large-scale web services. As data centers face increasing sustainability demands, such as reduced carbon emissions, efficient management of Flash overprovisioning and endurance has become crucial. Our analysis demonstrates that mixing data with different lifetimes on Flash blocks results in high device garbage collection costs, which either reduce device lifetime or necessitate host overprovisioning. Targeted data placement on Flash to minimize data intermixing and thus device write amplification shows promise for addressing this issue. The NVMe Flexible Data Placement (FDP) proposal is a newly ratified technical proposal aimed at addressing data placement needs while reducing the software engineering costs associated with past storage interfaces, such as ZNS and Open-Channel SSDs. In this study, we explore the feasibility, benefits, and limitations of leveraging NVMe FDP primitives for data placement on Flash media in CacheLib, a popular open-source Flash cache widely deployed and used in Meta's software ecosystem as a caching building block. We demonstrate that targeted data placement in CacheLib using NVMe FDP SSDs helps reduce device write amplification, embodied carbon emissions, and power consumption with almost no overhead to other metrics. Using multiple production traces and their configurations from Meta and Twitter, we show that an ideal device write amplification of ~1 can be achieved with FDP, leading to improved SSD utilization and sustainable Flash cache deployments.

Towards Efficient Flash Caches with Emerging NVMe Flexible Data Placement SSDs

TL;DR

The paper addresses the high device-level write amplification (DLWA) in Flash-based caches and its carbon implications. It proposes NVMe FDP-driven data placement to separate small, hot data from large, cold data within the CacheLib architecture, enabling targeted isolation of SOC and LOC data without altering cache design. A theoretical DLWA and CO2e model is developed and coupled with an implementation that introduces FDP-aware placement handles and I/O management, validated on production traces from Meta and Twitter. Results show DLWA near 1 and substantial reductions in embodied and operational carbon, along with improved SSD utilization and feasibility of multi-tenant deployments, illustrating a practical path to carbon-efficient Flash caches at scale.

Abstract

NVMe Flash-based SSDs are widely deployed in data centers to cache working sets of large-scale web services. As data centers face increasing sustainability demands, such as reduced carbon emissions, efficient management of Flash overprovisioning and endurance has become crucial. Our analysis demonstrates that mixing data with different lifetimes on Flash blocks results in high device garbage collection costs, which either reduce device lifetime or necessitate host overprovisioning. Targeted data placement on Flash to minimize data intermixing and thus device write amplification shows promise for addressing this issue. The NVMe Flexible Data Placement (FDP) proposal is a newly ratified technical proposal aimed at addressing data placement needs while reducing the software engineering costs associated with past storage interfaces, such as ZNS and Open-Channel SSDs. In this study, we explore the feasibility, benefits, and limitations of leveraging NVMe FDP primitives for data placement on Flash media in CacheLib, a popular open-source Flash cache widely deployed and used in Meta's software ecosystem as a caching building block. We demonstrate that targeted data placement in CacheLib using NVMe FDP SSDs helps reduce device write amplification, embodied carbon emissions, and power consumption with almost no overhead to other metrics. Using multiple production traces and their configurations from Meta and Twitter, we show that an ideal device write amplification of ~1 can be achieved with FDP, leading to improved SSD utilization and sustainable Flash cache deployments.

Paper Structure

This paper contains 40 sections, 3 theorems, 21 equations, 13 figures, 3 tables.

Key Result

Theorem 1

The DLWA for FDP-enabled CacheLib using SOC and LOC data segregation is, where $\delta$ denotes the average live SOC bucket migration due to garbage collection and is given by, where $\text{S}_{\text{SOC}}$ is the total SOC size in bytes, $\text{S}_{\text{P-SOC}}$ is the total physical space available for SOC data including device overprovisioning in bytes and $\mathcal{W}$ denotes the Lambert W

Figures (13)

  • Figure 1: CacheLib Architecture Overview
  • Figure 2: Conventional SSD vs FDP SSD Architecture.
  • Figure 3: SSD cross-section. 1a shows the intermixing of LOC’s sequential and cold data with SOC’s random and hot data in SSD blocks. 1b shows the inefficient use of device OP by both LOC and SOC data. 2a shows that with SOC data being segregated, invalidation of its data can result in free SSD blocks. 2b shows that with FDP, LOC data which is written sequentially will not cause DLWA. 2c shows the efficient use of device OP exclusively by SOC data to cushion SOC DLWA.
  • Figure 4: CacheLib I/O Path. 1a denotes the placement handle allocator that is responsible for allocating placement handles that consume FDP.
  • Figure 5: DLWA over 60 hours with the KV Cache workload using 50% device utilization, 42GB of RAM and 4% SOC size. FDP-based segregation results in a 1.3x reduction in DLWA.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • Theorem 3