Table of Contents
Fetching ...

Beyond Confusion: A Fine-grained Dialectical Examination of Human Activity Recognition Benchmark Datasets

Daniel Geissler, Dominique Nshimyimana, Vitor Fortes Rey, Sungho Suh, Bo Zhou, Paul Lukowicz

TL;DR

The paper interrogates HAR benchmark datasets beyond traditional metrics by identifying segments where no state-of-the-art model can correctly classify (the IFC). It trains a suite of milestone models across six IMU-based HAR datasets, then quantifies misclassifications shared across models, attributing them to annotation ambiguity, transition periods, and data quality issues. A key contribution is the trinary IFC mask (clean, minor, major) that tags IFC regions for dataset patching or training guidance, enabling more interpretable and auditable benchmarks. The work emphasizes data-centric improvements—better labeling practices, transition handling, and acquisition guidelines—over chasing marginal gains in model performance, with practical implications for future HAR data collection and evaluation. Overall, it champions granular data analysis as essential for robust HAR systems and proposes concrete tools to mitigate dataset ambiguities in real-world deployments.

Abstract

The research of machine learning (ML) algorithms for human activity recognition (HAR) has made significant progress with publicly available datasets. However, most research prioritizes statistical metrics over examining negative sample details. While recent models like transformers have been applied to HAR datasets with limited success from the benchmark metrics, their counterparts have effectively solved problems on similar levels with near 100% accuracy. This raises questions about the limitations of current approaches. This paper aims to address these open questions by conducting a fine-grained inspection of six popular HAR benchmark datasets. We identified for some parts of the data, none of the six chosen state-of-the-art ML methods can correctly classify, denoted as the intersect of false classifications (IFC). Analysis of the IFC reveals several underlying problems, including ambiguous annotations, irregularities during recording execution, and misaligned transition periods. We contribute to the field by quantifying and characterizing annotated data ambiguities, providing a trinary categorization mask for dataset patching, and stressing potential improvements for future data collections.

Beyond Confusion: A Fine-grained Dialectical Examination of Human Activity Recognition Benchmark Datasets

TL;DR

The paper interrogates HAR benchmark datasets beyond traditional metrics by identifying segments where no state-of-the-art model can correctly classify (the IFC). It trains a suite of milestone models across six IMU-based HAR datasets, then quantifies misclassifications shared across models, attributing them to annotation ambiguity, transition periods, and data quality issues. A key contribution is the trinary IFC mask (clean, minor, major) that tags IFC regions for dataset patching or training guidance, enabling more interpretable and auditable benchmarks. The work emphasizes data-centric improvements—better labeling practices, transition handling, and acquisition guidelines—over chasing marginal gains in model performance, with practical implications for future HAR data collection and evaluation. Overall, it champions granular data analysis as essential for robust HAR systems and proposes concrete tools to mitigate dataset ambiguities in real-world deployments.

Abstract

The research of machine learning (ML) algorithms for human activity recognition (HAR) has made significant progress with publicly available datasets. However, most research prioritizes statistical metrics over examining negative sample details. While recent models like transformers have been applied to HAR datasets with limited success from the benchmark metrics, their counterparts have effectively solved problems on similar levels with near 100% accuracy. This raises questions about the limitations of current approaches. This paper aims to address these open questions by conducting a fine-grained inspection of six popular HAR benchmark datasets. We identified for some parts of the data, none of the six chosen state-of-the-art ML methods can correctly classify, denoted as the intersect of false classifications (IFC). Analysis of the IFC reveals several underlying problems, including ambiguous annotations, irregularities during recording execution, and misaligned transition periods. We contribute to the field by quantifying and characterizing annotated data ambiguities, providing a trinary categorization mask for dataset patching, and stressing potential improvements for future data collections.

Paper Structure

This paper contains 22 sections, 1 equation, 20 figures, 6 tables.

Figures (20)

  • Figure 1: We aim to examine HAR research beyond the typical approach focusing on the statistical metrics and confusion matrices tested on public benchmark datasets to demonstrate improvement.
  • Figure 2: Extracting the Overlap of False Classification across the models, then merging the overlapping slinging windows into the 1D sequence for the Intersect of False Classification (IFC).
  • Figure 3: Chord diagrams of PAMAP2, with null class added on the left and null class removed on the right for clarity.
  • Figure 4: Chord diagrams and table of class distribution and confusion in Opportunity with locomotion labels.
  • Figure 5: Chord diagrams and table of class distribution and confusion in Opportunity with gesture labels.
  • ...and 15 more figures