Table of Contents
Fetching ...

Making Wide Stripes Practical: Cascaded Parity LRCs for Efficient Repair and High Reliability

Fan Yu, Guodong Li, Si Wu, Weijun Fang, Sihuang Hu

TL;DR

<3-5 sentence high-level summary>CP-LRCs address critical limitations of wide-stripe LRCs by introducing a cascaded parity structure that couples local parity blocks with a global parity block, enabling efficient parity repair and improved multi-node repair performance while preserving MDS-level fault tolerance. The authors present a general coefficient-generation framework and instantiate it as CP-Azure and CP-Uniform, with a distributed prototype demonstrated on Alibaba Cloud showing up to 41% reduction in single-node repair time and 26% reduction for two-node failures, plus substantial degraded-read gains for small files. They provide theoretical analyses of repair bandwidth, local-repair participation, and MTTDL, and validate the approach through cloud experiments and real-world traces, releasing the implementation for public use. The work significantly enhances the practicality and reliability of wide-stripe erasure coding in large-scale storage systems by enabling cohesive parity cooperation during repair."

Abstract

Erasure coding with wide stripes is increasingly adopted to reduce storage overhead in large-scale storage systems. However, existing Locally Repairable Codes (LRCs) exhibit structural limitations in this setting: inflated local groups increase single-node repair cost, multi-node failures frequently trigger expensive global repair, and reliability degrades sharply. We identify a key root cause: local and global parity blocks are designed independently, preventing them from cooperating during repair. We present Cascaded Parity LRCs (CP-LRCs), a new family of wide stripe LRCs that embed structured dependency between parity blocks by decomposing a global parity block across all local parity blocks. This creates a cascaded parity group that preserves MDS-level fault tolerance while enabling low-bandwidth single-node and multi-node repairs. We provide a general coefficient-generation framework, develop repair algorithms exploiting cascading, and instantiate the design with CP-Azure and CP-Uniform. Evaluations on Alibaba Cloud show reductions in repair time of up to 41% for single-node failures and 26% for two-node failures.

Making Wide Stripes Practical: Cascaded Parity LRCs for Efficient Repair and High Reliability

TL;DR

<3-5 sentence high-level summary>CP-LRCs address critical limitations of wide-stripe LRCs by introducing a cascaded parity structure that couples local parity blocks with a global parity block, enabling efficient parity repair and improved multi-node repair performance while preserving MDS-level fault tolerance. The authors present a general coefficient-generation framework and instantiate it as CP-Azure and CP-Uniform, with a distributed prototype demonstrated on Alibaba Cloud showing up to 41% reduction in single-node repair time and 26% reduction for two-node failures, plus substantial degraded-read gains for small files. They provide theoretical analyses of repair bandwidth, local-repair participation, and MTTDL, and validate the approach through cloud experiments and real-world traces, releasing the implementation for public use. The work significantly enhances the practicality and reliability of wide-stripe erasure coding in large-scale storage systems by enabling cohesive parity cooperation during repair."

Abstract

Erasure coding with wide stripes is increasingly adopted to reduce storage overhead in large-scale storage systems. However, existing Locally Repairable Codes (LRCs) exhibit structural limitations in this setting: inflated local groups increase single-node repair cost, multi-node failures frequently trigger expensive global repair, and reliability degrades sharply. We identify a key root cause: local and global parity blocks are designed independently, preventing them from cooperating during repair. We present Cascaded Parity LRCs (CP-LRCs), a new family of wide stripe LRCs that embed structured dependency between parity blocks by decomposing a global parity block across all local parity blocks. This creates a cascaded parity group that preserves MDS-level fault tolerance while enabling low-bandwidth single-node and multi-node repairs. We provide a general coefficient-generation framework, develop repair algorithms exploiting cascading, and instantiate the design with CP-Azure and CP-Uniform. Evaluations on Alibaba Cloud show reductions in repair time of up to 41% for single-node failures and 26% for two-node failures.

Paper Structure

This paper contains 57 sections, 2 theorems, 24 equations, 10 figures, 6 tables.

Key Result

Theorem 1

For a $(k, r)$ Cauchy RS code defined by the $k+r$ distinct elements $a_1, a_2, \dots, a_k$ and $b_1, b_2, \dots, b_r$, there exist $k+r$ nonzero coefficients $\bar{\gamma}_1, \dots, \bar{\gamma}_k$, $\bar{\eta}_1, \dots, \bar{\eta}_{r}$ such that

Figures (10)

  • Figure 1: Illustration of LRC constructions with parameters $(k=6, r=2, p=2)$. $D_1, D_2, \ldots, D_6$ denote data blocks; $L_1$, $L_2$ are local parity blocks; and $G_1$, $G_2$ are global parity blocks.
  • Figure 2: Markov chain for a $(6,2,2)$ LRC.
  • Figure 3: Illustration of CP-LRCs applied to Azure LRC and Uniform LRC with parameters $(6,2,2)$. $\beta_1, \beta_2,\ldots ,\beta_{6}$ and $\gamma_1 ,\gamma_2,\ldots ,\gamma_{6},\eta_1$ are the encoding coefficients for local parities. Note that $L_1$, $L_2$, and $G_2$ form a cascaded group, i.e., both CP-Azure and CP-Uniform satisfy that $G_2 = L_1 + L_2$.
  • Figure 4: Distributed Prototype System
  • Figure 5: Illustration of file-level repair optimization strategies under a stripe-based layout. $D_i$ ($i=1,2,\ldots$) denotes a data block, and $P_i$ represents a parity block. $F_1, F_2, \dots, F_6$ are files (or file fragments) distributed across blocks. The red hatched regions indicate the failed data segments that trigger degraded reads. The blue shaded regions mark the portions of surviving blocks accessed during repair.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Definition 1: Cauchy RS code
  • Theorem 1
  • proof
  • Corollary 1