Beyond Confusion: A Fine-grained Dialectical Examination of Human Activity Recognition Benchmark Datasets
Daniel Geissler, Dominique Nshimyimana, Vitor Fortes Rey, Sungho Suh, Bo Zhou, Paul Lukowicz
TL;DR
The paper interrogates HAR benchmark datasets beyond traditional metrics by identifying segments where no state-of-the-art model can correctly classify (the IFC). It trains a suite of milestone models across six IMU-based HAR datasets, then quantifies misclassifications shared across models, attributing them to annotation ambiguity, transition periods, and data quality issues. A key contribution is the trinary IFC mask (clean, minor, major) that tags IFC regions for dataset patching or training guidance, enabling more interpretable and auditable benchmarks. The work emphasizes data-centric improvements—better labeling practices, transition handling, and acquisition guidelines—over chasing marginal gains in model performance, with practical implications for future HAR data collection and evaluation. Overall, it champions granular data analysis as essential for robust HAR systems and proposes concrete tools to mitigate dataset ambiguities in real-world deployments.
Abstract
The research of machine learning (ML) algorithms for human activity recognition (HAR) has made significant progress with publicly available datasets. However, most research prioritizes statistical metrics over examining negative sample details. While recent models like transformers have been applied to HAR datasets with limited success from the benchmark metrics, their counterparts have effectively solved problems on similar levels with near 100% accuracy. This raises questions about the limitations of current approaches. This paper aims to address these open questions by conducting a fine-grained inspection of six popular HAR benchmark datasets. We identified for some parts of the data, none of the six chosen state-of-the-art ML methods can correctly classify, denoted as the intersect of false classifications (IFC). Analysis of the IFC reveals several underlying problems, including ambiguous annotations, irregularities during recording execution, and misaligned transition periods. We contribute to the field by quantifying and characterizing annotated data ambiguities, providing a trinary categorization mask for dataset patching, and stressing potential improvements for future data collections.
