Table of Contents
Fetching ...

Weak-Annotation of HAR Datasets using Vision Foundation Models

Marius Bock, Kristof Van Laerhoven, Michael Moeller

TL;DR

The paper addresses the labeling burden and limited scale of HAR datasets by introducing a clustering-based weak annotation pipeline that leverages vision foundation models to generate latent embeddings from video clips. Per-participant Gaussian Mixture Model clustering identifies centroid clips, which are then manually labeled and propagated to all cluster members, with distance-based outlier filtering. The weakly labeled data are used to train inertial-based deep classifiers with a PHGCE loss to handle label noise, achieving results close to fully supervised baselines on WEAR, ActionSense, and Wetlab, particularly when using a CLIP plus optical-flow embedding combination. This approach reduces annotation effort, preserves label quality, and can expand HAR benchmarks by incorporating richer multimodal representations learned by foundation models.

Abstract

As wearable-based data annotation remains, to date, a tedious, time-consuming task requiring researchers to dedicate substantial time, benchmark datasets within the field of Human Activity Recognition in lack richness and size compared to datasets available within related fields. Recently, vision foundation models such as CLIP have gained significant attention, helping the vision community advance in finding robust, generalizable feature representations. With the majority of researchers within the wearable community relying on vision modalities to overcome the limited expressiveness of wearable data and accurately label their to-be-released benchmark datasets offline, we propose a novel, clustering-based annotation pipeline to significantly reduce the amount of data that needs to be annotated by a human annotator. We show that using our approach, the annotation of centroid clips suffices to achieve average labelling accuracies close to 90% across three publicly available HAR benchmark datasets. Using the weakly annotated datasets, we further demonstrate that we can match the accuracy scores of fully-supervised deep learning classifiers across all three benchmark datasets. Code as well as supplementary figures and results are publicly downloadable via github.com/mariusbock/weak_har.

Weak-Annotation of HAR Datasets using Vision Foundation Models

TL;DR

The paper addresses the labeling burden and limited scale of HAR datasets by introducing a clustering-based weak annotation pipeline that leverages vision foundation models to generate latent embeddings from video clips. Per-participant Gaussian Mixture Model clustering identifies centroid clips, which are then manually labeled and propagated to all cluster members, with distance-based outlier filtering. The weakly labeled data are used to train inertial-based deep classifiers with a PHGCE loss to handle label noise, achieving results close to fully supervised baselines on WEAR, ActionSense, and Wetlab, particularly when using a CLIP plus optical-flow embedding combination. This approach reduces annotation effort, preserves label quality, and can expand HAR benchmarks by incorporating richer multimodal representations learned by foundation models.

Abstract

As wearable-based data annotation remains, to date, a tedious, time-consuming task requiring researchers to dedicate substantial time, benchmark datasets within the field of Human Activity Recognition in lack richness and size compared to datasets available within related fields. Recently, vision foundation models such as CLIP have gained significant attention, helping the vision community advance in finding robust, generalizable feature representations. With the majority of researchers within the wearable community relying on vision modalities to overcome the limited expressiveness of wearable data and accurately label their to-be-released benchmark datasets offline, we propose a novel, clustering-based annotation pipeline to significantly reduce the amount of data that needs to be annotated by a human annotator. We show that using our approach, the annotation of centroid clips suffices to achieve average labelling accuracies close to 90% across three publicly available HAR benchmark datasets. Using the weakly annotated datasets, we further demonstrate that we can match the accuracy scores of fully-supervised deep learning classifiers across all three benchmark datasets. Code as well as supplementary figures and results are publicly downloadable via github.com/mariusbock/weak_har.
Paper Structure (17 sections, 2 figures, 3 tables)

This paper contains 17 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Box-plot diagrams showing the distribution of labelling accuracies across study participants with increasing number of clusters. The bar plot below the box-plots provides details per cluster setting about the percentage of data compared to the total size of the three benchmark datasets bockWEAROutdoorSports2023schollWearablesWetLab2015delpretoActionSenseMultimodalDataset2022 an annotater would need to annotate. One can see a clear trend that with an increase in clusters, labelling accuracy increases along with deviation across study participants decreasing.
  • Figure 2: Confusion matrices comparing the shallow DeepConvLSTM fully-supervised results compared to that of the best performing weak-labelling approach. With exception of the NULL-class, all activities were able to be classified close to the performance of the fully-supervised approach.