Table of Contents
Fetching ...

Ensemble Method for System Failure Detection Using Large-Scale Telemetry Data

Priyanka Mudgal, Rita H. Wouhaybi

TL;DR

This work addresses reliable system failure detection using large-scale telemetry data collected from Intel ICIP-enabled PCs. It proposes an unsupervised ensemble approach in which a two-layer LSTM encoder learns latent representations of time-series telemetry, which are then fed into three outlier detectors (Isolation Forest, OCSVM, LOF) trained only on normal data. The architecture demonstrates that LSTM+Isolation Forest yields the strongest detection performance with favorable inference times, while remaining robust to data imbalance. The approach underscores the practical value of structured telemetry preprocessing and hybrid deep-learning plus classical anomaly detection for improving PC reliability and user experience.

Abstract

The growing reliance on computer systems, particularly personal computers (PCs), necessitates heightened reliability to uphold user satisfaction. This research paper presents an in-depth analysis of extensive system telemetry data, proposing an ensemble methodology for detecting system failures. Our approach entails scrutinizing various parameters of system metrics, encompassing CPU utilization, memory utilization, disk activity, CPU temperature, and pertinent system metadata such as system age, usage patterns, core count, and processor type. The proposed ensemble technique integrates a diverse set of algorithms, including Long Short-Term Memory (LSTM) networks, isolation forests, one-class support vector machines (OCSVM), and local outlier factors (LOF), to effectively discern system failures. Specifically, the LSTM network with other machine learning techniques is trained on Intel Computing Improvement Program (ICIP) telemetry software data to distinguish between normal and failed system patterns. Experimental evaluations demonstrate the remarkable efficacy of our models, achieving a notable detection rate in identifying system failures. Our research contributes to advancing the field of system reliability and offers practical insights for enhancing user experience in computing environments.

Ensemble Method for System Failure Detection Using Large-Scale Telemetry Data

TL;DR

This work addresses reliable system failure detection using large-scale telemetry data collected from Intel ICIP-enabled PCs. It proposes an unsupervised ensemble approach in which a two-layer LSTM encoder learns latent representations of time-series telemetry, which are then fed into three outlier detectors (Isolation Forest, OCSVM, LOF) trained only on normal data. The architecture demonstrates that LSTM+Isolation Forest yields the strongest detection performance with favorable inference times, while remaining robust to data imbalance. The approach underscores the practical value of structured telemetry preprocessing and hybrid deep-learning plus classical anomaly detection for improving PC reliability and user experience.

Abstract

The growing reliance on computer systems, particularly personal computers (PCs), necessitates heightened reliability to uphold user satisfaction. This research paper presents an in-depth analysis of extensive system telemetry data, proposing an ensemble methodology for detecting system failures. Our approach entails scrutinizing various parameters of system metrics, encompassing CPU utilization, memory utilization, disk activity, CPU temperature, and pertinent system metadata such as system age, usage patterns, core count, and processor type. The proposed ensemble technique integrates a diverse set of algorithms, including Long Short-Term Memory (LSTM) networks, isolation forests, one-class support vector machines (OCSVM), and local outlier factors (LOF), to effectively discern system failures. Specifically, the LSTM network with other machine learning techniques is trained on Intel Computing Improvement Program (ICIP) telemetry software data to distinguish between normal and failed system patterns. Experimental evaluations demonstrate the remarkable efficacy of our models, achieving a notable detection rate in identifying system failures. Our research contributes to advancing the field of system reliability and offers practical insights for enhancing user experience in computing environments.
Paper Structure (21 sections, 3 figures, 2 tables, 1 algorithm)

This paper contains 21 sections, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Platform Stack and examples of measures
  • Figure 2: Disk utilization (a and b) and CPU utilization (c and d) examples from the ICIP 7936346 dataset
  • Figure 3: Proposed ensemble architecture. Overall LSTM is used to encode the telemetry data, which is then fed to three different machine learning models, namely, isolation forest, one-class support vector machine, and local outlier factor.