RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision)

Alexander Thomasian

RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision)

Alexander Thomasian

TL;DR

The paper surveys reliability and performance design in storage systems, focusing on RAID and erasure-coding-based redundancy across HDDs and SSDs in cloud-scale environments. It systematically categorizes replication versus erasure coding, highlights MDS-optimal codes with $m = n-k$ check disks, and catalogs a broad spectrum of schemes (e.g., PMDS, RDP, LRC, 2D parity grids, HRAID, RESAR, 3D parity) along with practical considerations for rebuilds and load balancing. Performance modeling via queueing theory ($R = \bar{x}/(1-\rho)$, $W = \rho \bar{x}/(1-\rho)$, $\rho = \lambda \bar{x}$) and reliability analyses (CTMCs, simulations, and tools) are discussed to compare designs under hardware trends (HDD vs SSD) and cloud deployments. The practical impact lies in guiding hyperscalers and cloud providers to select and place redundancy schemes that balance reliability, tail latency, and total cost of ownership in heterogeneous storage ecosystems, including emerging technologies like 2D/3D parity and adaptive schemes.

Abstract

RAID proposal advocated replacing large disks with arrays of PC disks, but as the capacity of small disks increased 100-fold in 1990s the production of large disks was discontinued. Storage dependability is increased via replication or erasure coding. Cloud storage providers store multiple copies of data obviating for need for further redundancy. Varitaions of RAID based on local recovery codes, partial MDS reduce recovery cost. NAND flash Solid State Disks - SSDs have low latency and high bandwidth, are more reliable, consume less power and have a lower TCO than Hard Disk Drives, which are more viable for hyperscalers.

RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision)

TL;DR

check disks, and catalogs a broad spectrum of schemes (e.g., PMDS, RDP, LRC, 2D parity grids, HRAID, RESAR, 3D parity) along with practical considerations for rebuilds and load balancing. Performance modeling via queueing theory (

) and reliability analyses (CTMCs, simulations, and tools) are discussed to compare designs under hardware trends (HDD vs SSD) and cloud deployments. The practical impact lies in guiding hyperscalers and cloud providers to select and place redundancy schemes that balance reliability, tail latency, and total cost of ownership in heterogeneous storage ecosystems, including emerging technologies like 2D/3D parity and adaptive schemes.

Abstract

Paper Structure (9 sections, 5 equations, 2 tables)

This paper contains 9 sections, 5 equations, 2 tables.

Introduction to Storage Systems
Review of Hard Disk Drives
Storage Companies
Hyperscalers and Cloud Storage
Simple Disk Performance Queueing Models
Dealing with High Tail Latency
Data Compression, Compaction and Deduplication
Undetected Disk Errors and Silent Data Corruption - SDC
RAID classification and Extensions

RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision)

TL;DR

Abstract

RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision)

Authors

TL;DR

Abstract

Table of Contents