Table of Contents
Fetching ...

Investigating Memory Failure Prediction Across CPU Architectures

Qiao Yu, Wengui Zhang, Min Zhou, Jialiang Yu, Zhenli Sheng, Jasmin Bogatinovski, Jorge Cardoso, Odej Kao

TL;DR

This paper addresses memory reliability in large-scale datacenters by examining how memory errors and their predictability differ across CPU architectures (Intel X86 Purley/Whitley and Huawei ARM K920). It employs architecture-aware machine learning models on production data to predict memory failures, achieving up to a 15% improvement in F1-score on the Purley platform and revealing distinct UE patterns across platforms. Key contributions include the first cross-architecture DRAM failure analysis, ML-based failure predictors tailored to different CPUs, and an MLOps framework to sustain production performance. The work has practical impact by enabling more accurate, timely mitigations of memory failures in heterogeneous datacenters, improving reliability and service continuity.

Abstract

Large-scale datacenters often experience memory failures, where Uncorrectable Errors (UEs) highlight critical malfunction in Dual Inline Memory Modules (DIMMs). Existing approaches primarily utilize Correctable Errors (CEs) to predict UEs, yet they typically neglect how these errors vary between different CPU architectures, especially in terms of Error Correction Code (ECC) applicability. In this paper, we investigate the correlation between CEs and UEs across different CPU architectures, including X86 and ARM. Our analysis identifies unique patterns of memory failure associated with each processor platform. Leveraging Machine Learning (ML) techniques on production datasets, we conduct the memory failure prediction in different processors' platforms, achieving up to 15% improvements in F1-score compared to the existing algorithm. Finally, an MLOps (Machine Learning Operations) framework is provided to consistently improve the failure prediction in the production environment.

Investigating Memory Failure Prediction Across CPU Architectures

TL;DR

This paper addresses memory reliability in large-scale datacenters by examining how memory errors and their predictability differ across CPU architectures (Intel X86 Purley/Whitley and Huawei ARM K920). It employs architecture-aware machine learning models on production data to predict memory failures, achieving up to a 15% improvement in F1-score on the Purley platform and revealing distinct UE patterns across platforms. Key contributions include the first cross-architecture DRAM failure analysis, ML-based failure predictors tailored to different CPUs, and an MLOps framework to sustain production performance. The work has practical impact by enabling more accurate, timely mitigations of memory failures in heterogeneous datacenters, improving reliability and service continuity.

Abstract

Large-scale datacenters often experience memory failures, where Uncorrectable Errors (UEs) highlight critical malfunction in Dual Inline Memory Modules (DIMMs). Existing approaches primarily utilize Correctable Errors (CEs) to predict UEs, yet they typically neglect how these errors vary between different CPU architectures, especially in terms of Error Correction Code (ECC) applicability. In this paper, we investigate the correlation between CEs and UEs across different CPU architectures, including X86 and ARM. Our analysis identifies unique patterns of memory failure associated with each processor platform. Leveraging Machine Learning (ML) techniques on production datasets, we conduct the memory failure prediction in different processors' platforms, achieving up to 15% improvements in F1-score compared to the existing algorithm. Finally, an MLOps (Machine Learning Operations) framework is provided to consistently improve the failure prediction in the production environment.
Paper Structure (12 sections, 6 figures, 2 tables)

This paper contains 12 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Memory Organization.
  • Figure 2: VM Interruption under Failure Prediction.
  • Figure 3: Failure prediction problem definition huawei_2023_dsn.
  • Figure 4: Relative % of UE.
  • Figure 5: Analyses of Error Bits in Intel Platforms: Highlighting The Highest Rate with Red Bar.
  • ...and 1 more figures