Too Good To Be True: performance overestimation in (re)current practices for Human Activity Recognition
Andrés Tello, Victoria Degeler, Alexander Lazovik
TL;DR
The paper addresses a pervasive bias in Human Activity Recognition evaluations: sliding-window segmentation coupled with random train/test splits inflates reported accuracies by violating the independence assumption. It combines a literature review to show how widespread this flawed practice is with controlled experiments on MILAN, PAMAP2, and MHEALTH using RF and GNN models to quantify the bias. The authors demonstrate substantial performance drops when using unbiased group-based evaluations, highlighting the need for LOSO or group-wise CV to obtain realistic assessments of HAR systems. The work advocates adopting unbiased evaluation strategies to ensure fair benchmarking and reliable generalization claims in HAR research, with practical implications for reporting and comparing models.
Abstract
Today, there are standard and well established procedures within the Human Activity Recognition (HAR) pipeline. However, some of these conventional approaches lead to accuracy overestimation. In particular, sliding windows for data segmentation followed by standard random k-fold cross validation, produce biased results. An analysis of previous literature and present-day studies, surprisingly, shows that these are common approaches in state-of-the-art studies on HAR. It is important to raise awareness in the scientific community about this problem, whose negative effects are being overlooked. Otherwise, publications of biased results lead to papers that report lower accuracies, with correct unbiased methods, harder to publish. Several experiments with different types of datasets and different types of classification models allow us to exhibit the problem and show it persists independently of the method or dataset.
