Too Good To Be True: performance overestimation in (re)current practices for Human Activity Recognition

Andrés Tello; Victoria Degeler; Alexander Lazovik

Too Good To Be True: performance overestimation in (re)current practices for Human Activity Recognition

Andrés Tello, Victoria Degeler, Alexander Lazovik

TL;DR

The paper addresses a pervasive bias in Human Activity Recognition evaluations: sliding-window segmentation coupled with random train/test splits inflates reported accuracies by violating the independence assumption. It combines a literature review to show how widespread this flawed practice is with controlled experiments on MILAN, PAMAP2, and MHEALTH using RF and GNN models to quantify the bias. The authors demonstrate substantial performance drops when using unbiased group-based evaluations, highlighting the need for LOSO or group-wise CV to obtain realistic assessments of HAR systems. The work advocates adopting unbiased evaluation strategies to ensure fair benchmarking and reliable generalization claims in HAR research, with practical implications for reporting and comparing models.

Abstract

Today, there are standard and well established procedures within the Human Activity Recognition (HAR) pipeline. However, some of these conventional approaches lead to accuracy overestimation. In particular, sliding windows for data segmentation followed by standard random k-fold cross validation, produce biased results. An analysis of previous literature and present-day studies, surprisingly, shows that these are common approaches in state-of-the-art studies on HAR. It is important to raise awareness in the scientific community about this problem, whose negative effects are being overlooked. Otherwise, publications of biased results lead to papers that report lower accuracies, with correct unbiased methods, harder to publish. Several experiments with different types of datasets and different types of classification models allow us to exhibit the problem and show it persists independently of the method or dataset.

Too Good To Be True: performance overestimation in (re)current practices for Human Activity Recognition

TL;DR

Abstract

Paper Structure (19 sections, 10 figures, 2 tables)

This paper contains 19 sections, 10 figures, 2 tables.

Introduction
Model performance overestimation
(Re)current practices in HAR
Unbiased model evaluation
Datasets
MILAN
PAMAP2
MHEALTH
Data segmentation and feature extraction
MILAN
PAMAP2
MHEALTH
Classification models and evaluation strategies
MILAN
PAMAP2 and MHEALTH
...and 4 more sections

Figures (10)

Figure 1: Overlapping and non-overlapping sliding windows data segmentation
Figure 2: Reported classification accuracy from Altun et al., altun2010comparative on UCI Daily and Sport Activities dataset.
Figure 3: 5-fold CV vs LOSO: reported accuracy comparison from Micucci et al., micucci2017unimib
Figure 4: Reported F1-Score of a RF classifier from San-Segundo et al. san2018robust. (a): different normalization techniques. (b): feature extraction techniques after z-score normalization.
Figure 5: Reported results from Mutegeki and Han mutegeki2020cnn.
...and 5 more figures

Too Good To Be True: performance overestimation in (re)current practices for Human Activity Recognition

TL;DR

Abstract

Too Good To Be True: performance overestimation in (re)current practices for Human Activity Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (10)