Table of Contents
Fetching ...

Silent Data Corruptions at Scale

Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, Sriram Sankar

TL;DR

This paper investigates Silent Data Corruptions (SDCs) in Facebook’s datacenter scale, showing that silicon defects can silently produce incorrect computations that propagate to application-level failures. It presents a real-world case study, a structured defect taxonomy, and a scalable debugging workflow that traces errors from Scala to assembly using cross-language reproducers and profiling tools. The authors discuss detection mechanisms and hardware- and software-based mitigations, including data-path protection, fleet testing strategies, redundancy, and fault-tolerant libraries. The work demonstrates that reducing SDC risk requires close hardware-software collaboration and scalable fault-tolerance in production systems. The results emphasize that as datacenter hardware densifies, proactive design choices and robust software architectures are essential for reliable large-scale services.

Abstract

Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure services. SDCs are not captured by error reporting mechanisms within a Central Processing Unit (CPU) and hence are not traceable at the hardware level. However, the data corruptions propagate across the stack and manifest as application-level problems. These types of errors can result in data loss and can require months of debug engineering time. In this paper, we describe common defect types observed in silicon manufacturing that leads to SDCs. We discuss a real-world example of silent data corruption within a datacenter application. We provide the debug flow followed to root-cause and triage faulty instructions within a CPU using a case study, as an illustration on how to debug this class of errors. We provide a high-level overview of the mitigations to reduce the risk of silent data corruptions within a large production fleet. In our large-scale infrastructure, we have run a vast library of silent error test scenarios across hundreds of thousands of machines in our fleet. This has resulted in hundreds of CPUs detected for these errors, showing that SDCs are a systemic issue across generations. We have monitored SDCs for a period longer than 18 months. Based on this experience, we determine that reducing silent data corruptions requires not only hardware resiliency and production detection mechanisms, but also robust fault-tolerant software architectures.

Silent Data Corruptions at Scale

TL;DR

This paper investigates Silent Data Corruptions (SDCs) in Facebook’s datacenter scale, showing that silicon defects can silently produce incorrect computations that propagate to application-level failures. It presents a real-world case study, a structured defect taxonomy, and a scalable debugging workflow that traces errors from Scala to assembly using cross-language reproducers and profiling tools. The authors discuss detection mechanisms and hardware- and software-based mitigations, including data-path protection, fleet testing strategies, redundancy, and fault-tolerant libraries. The work demonstrates that reducing SDC risk requires close hardware-software collaboration and scalable fault-tolerance in production systems. The results emphasize that as datacenter hardware densifies, proactive design choices and robust software architectures are essential for reliable large-scale services.

Abstract

Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure services. SDCs are not captured by error reporting mechanisms within a Central Processing Unit (CPU) and hence are not traceable at the hardware level. However, the data corruptions propagate across the stack and manifest as application-level problems. These types of errors can result in data loss and can require months of debug engineering time. In this paper, we describe common defect types observed in silicon manufacturing that leads to SDCs. We discuss a real-world example of silent data corruption within a datacenter application. We provide the debug flow followed to root-cause and triage faulty instructions within a CPU using a case study, as an illustration on how to debug this class of errors. We provide a high-level overview of the mitigations to reduce the risk of silent data corruptions within a large production fleet. In our large-scale infrastructure, we have run a vast library of silent error test scenarios across hundreds of thousands of machines in our fleet. This has resulted in hundreds of CPUs detected for these errors, showing that SDCs are a systemic issue across generations. We have monitored SDCs for a period longer than 18 months. Based on this experience, we determine that reducing silent data corruptions requires not only hardware resiliency and production detection mechanisms, but also robust fault-tolerant software architectures.

Paper Structure

This paper contains 27 sections, 5 equations, 3 figures.

Figures (3)

  • Figure 1: High Level Spark Architecture
  • Figure 2: Application level silent data corruption
  • Figure 3: High Level Debug Flow