Table of Contents
Fetching ...

Towards a more realistic evaluation of machine learning models for bearing fault diagnosis

João Paulo Vieira, Victor Afonso Bauler, Rodrigo Kobashikawa Rosa, Danilo Silva

TL;DR

This paper tackles the pervasive problem of data leakage in vibration-based bearing fault diagnosis and argues that traditional data splits inflate performance estimates. It proposes a leakage-free evaluation framework using bearing-wise data partitioning, reformulates fault diagnosis as a multi-label binary task evaluated with Macro AUROC, and employs a Double Cross-Validation scheme to reduce bias. Through experiments on CWRU, PU, and UORED-VAFCLS, it shows that training data diversity, particularly the number of unique bearings, strongly influences generalization, and that deep models are not always superior to handcrafted features. The work further demonstrates the substantial impact of leakage by comparing leakage-free results with common leakage-prone splits and provides practical guidelines for dataset design, model selection, and reproducible evaluation to advance robust industrial fault diagnosis systems.

Abstract

Reliable detection of bearing faults is essential for maintaining the safety and operational efficiency of rotating machinery. While recent advances in machine learning (ML), particularly deep learning, have shown strong performance in controlled settings, many studies fail to generalize to real-world applications due to methodological flaws, most notably data leakage. This paper investigates the issue of data leakage in vibration-based bearing fault diagnosis and its impact on model evaluation. We demonstrate that common dataset partitioning strategies, such as segment-wise and condition-wise splits, introduce spurious correlations that inflate performance metrics. To address this, we propose a rigorous, leakage-free evaluation methodology centered on bearing-wise data partitioning, ensuring no overlap between the physical components used for training and testing. Additionally, we reformulate the classification task as a multi-label problem, enabling the detection of co-occurring fault types and the use of prevalence-independent metrics such as Macro AUROC. Beyond preventing leakage, we also examine the effect of dataset diversity on generalization, showing that the number of unique training bearings is a decisive factor for achieving robust performance. We evaluate our methodology on three widely adopted datasets: CWRU, Paderborn University (PU), and University of Ottawa (UORED-VAFCLS). This study highlights the importance of leakage-aware evaluation protocols and provides practical guidelines for dataset partitioning, model selection, and validation, fostering the development of more trustworthy ML systems for industrial fault diagnosis applications.

Towards a more realistic evaluation of machine learning models for bearing fault diagnosis

TL;DR

This paper tackles the pervasive problem of data leakage in vibration-based bearing fault diagnosis and argues that traditional data splits inflate performance estimates. It proposes a leakage-free evaluation framework using bearing-wise data partitioning, reformulates fault diagnosis as a multi-label binary task evaluated with Macro AUROC, and employs a Double Cross-Validation scheme to reduce bias. Through experiments on CWRU, PU, and UORED-VAFCLS, it shows that training data diversity, particularly the number of unique bearings, strongly influences generalization, and that deep models are not always superior to handcrafted features. The work further demonstrates the substantial impact of leakage by comparing leakage-free results with common leakage-prone splits and provides practical guidelines for dataset design, model selection, and reproducible evaluation to advance robust industrial fault diagnosis systems.

Abstract

Reliable detection of bearing faults is essential for maintaining the safety and operational efficiency of rotating machinery. While recent advances in machine learning (ML), particularly deep learning, have shown strong performance in controlled settings, many studies fail to generalize to real-world applications due to methodological flaws, most notably data leakage. This paper investigates the issue of data leakage in vibration-based bearing fault diagnosis and its impact on model evaluation. We demonstrate that common dataset partitioning strategies, such as segment-wise and condition-wise splits, introduce spurious correlations that inflate performance metrics. To address this, we propose a rigorous, leakage-free evaluation methodology centered on bearing-wise data partitioning, ensuring no overlap between the physical components used for training and testing. Additionally, we reformulate the classification task as a multi-label problem, enabling the detection of co-occurring fault types and the use of prevalence-independent metrics such as Macro AUROC. Beyond preventing leakage, we also examine the effect of dataset diversity on generalization, showing that the number of unique training bearings is a decisive factor for achieving robust performance. We evaluate our methodology on three widely adopted datasets: CWRU, Paderborn University (PU), and University of Ottawa (UORED-VAFCLS). This study highlights the importance of leakage-aware evaluation protocols and provides practical guidelines for dataset partitioning, model selection, and validation, fostering the development of more trustworthy ML systems for industrial fault diagnosis applications.

Paper Structure

This paper contains 33 sections, 15 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Comparison of Decision Tree (DT) and Logistic Regression (LR) accuracy across varying numbers of training bearings, evaluated under two conditions: a leakage-free test (Valid) set and a test set with data leakage (Leakage).
  • Figure 2: Exemplary bearing-level data partitioning for the generic dataset. The training set (green) and test set (blue) are disjoint at the bearing level, with a 3:2 allocation of bearings per health state.
  • Figure 3: Specification of the generic bearing fault dataset, comprising 15 unique bearings, two fault modes (inner, outer), and two distinct acquisition configurations per bearing.
  • Figure 4: Schematic of the Double Cross-Validation (CVM-CV) protocol applied to the UORED-VAFCLS dataset. A distinct set of 5 bearing-level splits is used for hyperparameter tuning, while a separate set of 100 splits is used for final performance evaluation.
  • Figure 5: Schematic of the Double Cross-Validation (CVM-CV) protocol applied to the PU dataset.
  • ...and 8 more figures