Table of Contents
Fetching ...

Unraveling codes: fast, robust, beyond-bound error correction for DRAM

Mike Hamburg, Eric Linstadt, Danny Moore, Thomas Vogelsang

TL;DR

A new family of full-block generalized RS codes that combine the speed and beyond-bound correction capabilities of interleaved codes with the robustness of full-block codes, including the ability of the latter to reliably correct failures across multiple devices.

Abstract

Generalized Reed-Solomon (RS) codes are a common choice for efficient, reliable error correction in memory and communications systems. These codes add $2t$ extra parity symbols to a block of memory, and can efficiently and reliably correct up to $t$ symbol errors in that block. Decoding is possible beyond this bound, but it is imperfectly reliable and often computationally expensive. Beyond-bound decoding is an important problem to solve for error-correcting Dynamic Random Access Memory (DRAM). These memories are often designed so that each access touches two extra memory devices, so that a failure in any one device can be corrected. But system architectures increasingly require DRAM to store metadata in addition to user data. When the metadata replaces parity data, a single-device failure is then beyond-bound. An error-correction system can either protect each access with a single RS code, or divide it into several segments protected with a shorter code, usually in an Interleaved Reed-Solomon (IRS) configuration. The full-block RS approach is more reliable, both at correcting errors and at preventing silent data corruption (SDC). The IRS option is faster, and is especially efficient at beyond-bound correction of single- or double-device failures. Here we describe a new family of "unraveling" Reed-Solomon codes that bridges the gap between these options. Our codes are full-block generalized RS codes, but they can also be decoded using an IRS decoder. As a result, they combine the speed and beyond-bound correction capabilities of interleaved codes with the robustness of full-block codes, including the ability of the latter to reliably correct failures across multiple devices. We show that unraveling codes are an especially good fit for high-reliability DRAM error correction.

Unraveling codes: fast, robust, beyond-bound error correction for DRAM

TL;DR

A new family of full-block generalized RS codes that combine the speed and beyond-bound correction capabilities of interleaved codes with the robustness of full-block codes, including the ability of the latter to reliably correct failures across multiple devices.

Abstract

Generalized Reed-Solomon (RS) codes are a common choice for efficient, reliable error correction in memory and communications systems. These codes add extra parity symbols to a block of memory, and can efficiently and reliably correct up to symbol errors in that block. Decoding is possible beyond this bound, but it is imperfectly reliable and often computationally expensive. Beyond-bound decoding is an important problem to solve for error-correcting Dynamic Random Access Memory (DRAM). These memories are often designed so that each access touches two extra memory devices, so that a failure in any one device can be corrected. But system architectures increasingly require DRAM to store metadata in addition to user data. When the metadata replaces parity data, a single-device failure is then beyond-bound. An error-correction system can either protect each access with a single RS code, or divide it into several segments protected with a shorter code, usually in an Interleaved Reed-Solomon (IRS) configuration. The full-block RS approach is more reliable, both at correcting errors and at preventing silent data corruption (SDC). The IRS option is faster, and is especially efficient at beyond-bound correction of single- or double-device failures. Here we describe a new family of "unraveling" Reed-Solomon codes that bridges the gap between these options. Our codes are full-block generalized RS codes, but they can also be decoded using an IRS decoder. As a result, they combine the speed and beyond-bound correction capabilities of interleaved codes with the robustness of full-block codes, including the ability of the latter to reliably correct failures across multiple devices. We show that unraveling codes are an especially good fit for high-reliability DRAM error correction.
Paper Structure (27 sections, 3 theorems, 22 equations, 5 figures)

This paper contains 27 sections, 3 theorems, 22 equations, 5 figures.

Key Result

Theorem 1

The map $\mathrm{unravel}$ indeed maps $\mathcal{C}\xspace\to \mathcal{C}\xspace_0^{\ell-a} \times \mathcal{C}\xspace_1^a$.

Figures (5)

  • Figure 1: Interleaving with $\ell=3$. The $\alpha_i$ in a code position is the label of that symbol, not its value. A column error's locator must satisfy the key equation of all $\ell$ subcodes.
  • Figure 2: Two possible assignments of bits to RS symbols in DDR5. User data in blue, metadata in green, parity in red. Above, a cache line is encoded as a single $\mathrm{RS}\xspace(2^8; 80,65)$ codeword. Below, it instead uses $\ell=4$ interleaved codewords, the first three from $\mathrm{RS}\xspace(2^8; 20,16)$ and the last from $\mathrm{RS}\xspace(2^8; 20,17)$. The interleaved code is faster and has straightforward BBCC, but it is more vulnerable to uncorrectable errors or miscorrection. A codeword for the above arrangement can be unraveled to the below one, enabling BBCC without the increased risk of miscorrection. Since $\ell=4$, the unraveling map operates on groups of four symbols at a time, such as pairs of columns as highlighted in saturated blue. Perhaps surprisingly, this works even for the group with one metadata symbol and three parities.
  • Figure 3: Unraveling with $\ell=2$. The $\alpha$ or $\beta$ in the code position is the label of that symbol. The single $\mathrm{RS}\xspace(2n,2k)$ code can be decoded directly, reliably correcting up to $n-k$ errors in any pattern. Or, it can be converted into two interleaved $\mathrm{RS}\xspace(n,k)$ codes with the same rate. These can probabilistically decode up $\lfloor\frac{4}{3}(n-k)\rfloor$ errors if they affect at most $\lfloor\frac{2}{3}(n-k)\rfloor$ columns, or slightly more with advanced decoding algorithms puchinger2017decoding.
  • Figure 4: Concrete application to error-correcting $\times 4$ DDR5, 512-bit user data and 16-bit metadata, and BBCC. The best or almost-best values are highlighted in green. Uses 8-bit symbols unless stated. The IRS codes use the matching columns heuristic to reduce miscorrection rate. The URS code is the best in all categories, except for miscorrection probability against dense errors. This is only because it can correct more error patterns than the IRS code.
  • Figure 5: High-reliability core parameters (all with $q=2^8$) and failure probabilities for different amounts of metadata. BBCC code: the code shape after unraveling for BBCC. With no metadata, this is unused because the 4DQ error correction can already reliably correct single-device errors. CC DUE: the proportion of single-device errors that are uncorrectable. Weight: the number of bytes on a single device (out of 8) that must be corrupted before the error can be uncorrectable. Random SDC: the proportion of multi-device errors (corrupting many symbols across many devices) that result in miscorrection and thus silent data corruption.

Theorems & Definitions (5)

  • Theorem 1
  • proof
  • Lemma 1
  • Corollary 1
  • proof