Table of Contents
Fetching ...

SCANIA Component X Dataset: A Real-World Multivariate Time Series Dataset for Predictive Maintenance

Zahra Kharazian, Tony Lindgren, Sindri Magnússon, Olof Steinert, Oskar Andersson Reyna

TL;DR

The paper releases the Lindgren SCANIA Component X dataset, a real-world, anonymized multivariate time-series collection designed for predictive maintenance research. It combines operational readouts, repair (time-to-event) records, and anonymized truck specifications from a large SCANIA fleet, enabling tasks such as classification, regression, survival analysis, and anomaly detection. The dataset includes 1.12 million readouts across 23,550 vehicles with 14 anonymized features organized as histograms and counters, plus a time-to-event label and separate specifications, and it is partitioned into training, validation, and testing sets to support reproducible benchmarking. A privacy-preserving protocol (relative times, anonymized IDs, data perturbations) and a defined cost-based evaluation framework are provided, making it a practical benchmark for industry-academic collaboration and a reference for PdM research in real-world settings.

Abstract

Predicting failures and maintenance time in predictive maintenance is challenging due to the scarcity of comprehensive real-world datasets, and among those available, few are of time series format. This paper introduces a real-world, multivariate time series dataset collected exclusively from a single anonymized engine component (Component X) across a fleet of SCANIA trucks. The dataset includes operational data, repair records, and specifications related to Component X, while maintaining confidentiality through anonymization. It is well-suited for a range of machine learning applications, including classification, regression, survival analysis, and anomaly detection, particularly in predictive maintenance scenarios. The dataset's large population size, diverse features (in the form of histograms and numerical counters), and temporal information make it a unique resource in the field. The objective of releasing this dataset is to give a broad range of researchers the possibility of working with real-world data from an internationally well-known company and introduce a standard benchmark to the predictive maintenance field, fostering reproducible research.

SCANIA Component X Dataset: A Real-World Multivariate Time Series Dataset for Predictive Maintenance

TL;DR

The paper releases the Lindgren SCANIA Component X dataset, a real-world, anonymized multivariate time-series collection designed for predictive maintenance research. It combines operational readouts, repair (time-to-event) records, and anonymized truck specifications from a large SCANIA fleet, enabling tasks such as classification, regression, survival analysis, and anomaly detection. The dataset includes 1.12 million readouts across 23,550 vehicles with 14 anonymized features organized as histograms and counters, plus a time-to-event label and separate specifications, and it is partitioned into training, validation, and testing sets to support reproducible benchmarking. A privacy-preserving protocol (relative times, anonymized IDs, data perturbations) and a defined cost-based evaluation framework are provided, making it a practical benchmark for industry-academic collaboration and a reference for PdM research in real-world settings.

Abstract

Predicting failures and maintenance time in predictive maintenance is challenging due to the scarcity of comprehensive real-world datasets, and among those available, few are of time series format. This paper introduces a real-world, multivariate time series dataset collected exclusively from a single anonymized engine component (Component X) across a fleet of SCANIA trucks. The dataset includes operational data, repair records, and specifications related to Component X, while maintaining confidentiality through anonymization. It is well-suited for a range of machine learning applications, including classification, regression, survival analysis, and anomaly detection, particularly in predictive maintenance scenarios. The dataset's large population size, diverse features (in the form of histograms and numerical counters), and temporal information make it a unique resource in the field. The objective of releasing this dataset is to give a broad range of researchers the possibility of working with real-world data from an internationally well-known company and introduce a standard benchmark to the predictive maintenance field, fostering reproducible research.
Paper Structure (22 sections, 1 equation, 8 figures, 1 table)

This paper contains 22 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1:
  • Figure 4:
  • Figure 7: The missing value percentage in train_operational_readouts.csv file is less than 1% per feature column. Note that the y-axis shows the percentage of missing and is limited to 1.5 for better visualization.
  • Figure 8:
  • Figure 11: History of readouts for ten random vehicles.
  • ...and 3 more figures