Table of Contents
Fetching ...

RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision)

Alexander Thomasian

TL;DR

The paper surveys reliability and performance design in storage systems, focusing on RAID and erasure-coding-based redundancy across HDDs and SSDs in cloud-scale environments. It systematically categorizes replication versus erasure coding, highlights MDS-optimal codes with $m = n-k$ check disks, and catalogs a broad spectrum of schemes (e.g., PMDS, RDP, LRC, 2D parity grids, HRAID, RESAR, 3D parity) along with practical considerations for rebuilds and load balancing. Performance modeling via queueing theory ($R = \bar{x}/(1-\rho)$, $W = \rho \bar{x}/(1-\rho)$, $\rho = \lambda \bar{x}$) and reliability analyses (CTMCs, simulations, and tools) are discussed to compare designs under hardware trends (HDD vs SSD) and cloud deployments. The practical impact lies in guiding hyperscalers and cloud providers to select and place redundancy schemes that balance reliability, tail latency, and total cost of ownership in heterogeneous storage ecosystems, including emerging technologies like 2D/3D parity and adaptive schemes.

Abstract

RAID proposal advocated replacing large disks with arrays of PC disks, but as the capacity of small disks increased 100-fold in 1990s the production of large disks was discontinued. Storage dependability is increased via replication or erasure coding. Cloud storage providers store multiple copies of data obviating for need for further redundancy. Varitaions of RAID based on local recovery codes, partial MDS reduce recovery cost. NAND flash Solid State Disks - SSDs have low latency and high bandwidth, are more reliable, consume less power and have a lower TCO than Hard Disk Drives, which are more viable for hyperscalers.

RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision)

TL;DR

The paper surveys reliability and performance design in storage systems, focusing on RAID and erasure-coding-based redundancy across HDDs and SSDs in cloud-scale environments. It systematically categorizes replication versus erasure coding, highlights MDS-optimal codes with check disks, and catalogs a broad spectrum of schemes (e.g., PMDS, RDP, LRC, 2D parity grids, HRAID, RESAR, 3D parity) along with practical considerations for rebuilds and load balancing. Performance modeling via queueing theory (, , ) and reliability analyses (CTMCs, simulations, and tools) are discussed to compare designs under hardware trends (HDD vs SSD) and cloud deployments. The practical impact lies in guiding hyperscalers and cloud providers to select and place redundancy schemes that balance reliability, tail latency, and total cost of ownership in heterogeneous storage ecosystems, including emerging technologies like 2D/3D parity and adaptive schemes.

Abstract

RAID proposal advocated replacing large disks with arrays of PC disks, but as the capacity of small disks increased 100-fold in 1990s the production of large disks was discontinued. Storage dependability is increased via replication or erasure coding. Cloud storage providers store multiple copies of data obviating for need for further redundancy. Varitaions of RAID based on local recovery codes, partial MDS reduce recovery cost. NAND flash Solid State Disks - SSDs have low latency and high bandwidth, are more reliable, consume less power and have a lower TCO than Hard Disk Drives, which are more viable for hyperscalers.
Paper Structure (9 sections, 5 equations, 2 tables)