Making Wide Stripes Practical: Cascaded Parity LRCs for Efficient Repair and High Reliability
Fan Yu, Guodong Li, Si Wu, Weijun Fang, Sihuang Hu
TL;DR
<3-5 sentence high-level summary>CP-LRCs address critical limitations of wide-stripe LRCs by introducing a cascaded parity structure that couples local parity blocks with a global parity block, enabling efficient parity repair and improved multi-node repair performance while preserving MDS-level fault tolerance. The authors present a general coefficient-generation framework and instantiate it as CP-Azure and CP-Uniform, with a distributed prototype demonstrated on Alibaba Cloud showing up to 41% reduction in single-node repair time and 26% reduction for two-node failures, plus substantial degraded-read gains for small files. They provide theoretical analyses of repair bandwidth, local-repair participation, and MTTDL, and validate the approach through cloud experiments and real-world traces, releasing the implementation for public use. The work significantly enhances the practicality and reliability of wide-stripe erasure coding in large-scale storage systems by enabling cohesive parity cooperation during repair."
Abstract
Erasure coding with wide stripes is increasingly adopted to reduce storage overhead in large-scale storage systems. However, existing Locally Repairable Codes (LRCs) exhibit structural limitations in this setting: inflated local groups increase single-node repair cost, multi-node failures frequently trigger expensive global repair, and reliability degrades sharply. We identify a key root cause: local and global parity blocks are designed independently, preventing them from cooperating during repair. We present Cascaded Parity LRCs (CP-LRCs), a new family of wide stripe LRCs that embed structured dependency between parity blocks by decomposing a global parity block across all local parity blocks. This creates a cascaded parity group that preserves MDS-level fault tolerance while enabling low-bandwidth single-node and multi-node repairs. We provide a general coefficient-generation framework, develop repair algorithms exploiting cascading, and instantiate the design with CP-Azure and CP-Uniform. Evaluations on Alibaba Cloud show reductions in repair time of up to 41% for single-node failures and 26% for two-node failures.
