Table of Contents
Fetching ...

A Matter of Annotation: An Empirical Study on In Situ and Self-Recall Activity Annotations from Wearable Sensors

Alexander Hoelzemann, Kristof Van Laerhoven

TL;DR

This study interrogates how four annotation methods for wearable-sensor HAR data influence label quality and downstream classifier performance in real-world settings. Using 11 participants over two weeks and a DeepConvLSTM classifier, the authors show that in-situ labeling yields precise segments but suffers from missing annotations, while diary-based approaches struggle with precision. Importantly, adding a time-series visualization tool (MAD-GUI) to a diary-based workflow reduced missing labels and, when combined with a data-grounded labeling process, improved F1 scores by up to 8% (to as high as 90.4%), underscoring the value of user-friendly annotation aids. The findings highlight annotation biases and offer actionable guidance for designing labeling workflows in long-term, in-the-wild HAR studies, with implications for privacy-preserving, high-quality ground-truth generation and model training.

Abstract

Research into the detection of human activities from wearable sensors is a highly active field, benefiting numerous applications, from ambulatory monitoring of healthcare patients via fitness coaching to streamlining manual work processes. We present an empirical study that evaluates and contrasts four commonly employed annotation methods in user studies focused on in-the-wild data collection. For both the user-driven, in situ annotations, where participants annotate their activities during the actual recording process, and the recall methods, where participants retrospectively annotate their data at the end of each day, the participants had the flexibility to select their own set of activity classes and corresponding labels. Our study illustrates that different labeling methodologies directly impact the annotations' quality, as well as the capabilities of a deep learning classifier trained with the data. We noticed that in situ methods produce less but more precise labels than recall methods. Furthermore, we combined an activity diary with a visualization tool that enables the participant to inspect and label their activity data. Due to the introduction of such a tool were able to decrease missing annotations and increase the annotation consistency, and therefore the F1-Score of the deep learning model by up to 8% (ranging between 82.1 and 90.4% F1-Score). Furthermore, we discuss the advantages and disadvantages of the methods compared in our study, the biases they could introduce, and the consequences of their usage on human activity recognition studies as well as possible solutions.

A Matter of Annotation: An Empirical Study on In Situ and Self-Recall Activity Annotations from Wearable Sensors

TL;DR

This study interrogates how four annotation methods for wearable-sensor HAR data influence label quality and downstream classifier performance in real-world settings. Using 11 participants over two weeks and a DeepConvLSTM classifier, the authors show that in-situ labeling yields precise segments but suffers from missing annotations, while diary-based approaches struggle with precision. Importantly, adding a time-series visualization tool (MAD-GUI) to a diary-based workflow reduced missing labels and, when combined with a data-grounded labeling process, improved F1 scores by up to 8% (to as high as 90.4%), underscoring the value of user-friendly annotation aids. The findings highlight annotation biases and offer actionable guidance for designing labeling workflows in long-term, in-the-wild HAR studies, with implications for privacy-preserving, high-quality ground-truth generation and model training.

Abstract

Research into the detection of human activities from wearable sensors is a highly active field, benefiting numerous applications, from ambulatory monitoring of healthcare patients via fitness coaching to streamlining manual work processes. We present an empirical study that evaluates and contrasts four commonly employed annotation methods in user studies focused on in-the-wild data collection. For both the user-driven, in situ annotations, where participants annotate their activities during the actual recording process, and the recall methods, where participants retrospectively annotate their data at the end of each day, the participants had the flexibility to select their own set of activity classes and corresponding labels. Our study illustrates that different labeling methodologies directly impact the annotations' quality, as well as the capabilities of a deep learning classifier trained with the data. We noticed that in situ methods produce less but more precise labels than recall methods. Furthermore, we combined an activity diary with a visualization tool that enables the participant to inspect and label their activity data. Due to the introduction of such a tool were able to decrease missing annotations and increase the annotation consistency, and therefore the F1-Score of the deep learning model by up to 8% (ranging between 82.1 and 90.4% F1-Score). Furthermore, we discuss the advantages and disadvantages of the methods compared in our study, the biases they could introduce, and the consequences of their usage on human activity recognition studies as well as possible solutions.
Paper Structure (15 sections, 8 figures, 5 tables)

This paper contains 15 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The study participants collected data for 14 days in total and annotated the data with 4 different methods: Labeling in situ with a mechanical button, in situ with an app, by writing a pure self-recall diary and writing a self-recall diary assisted by visualization of their time-series data.
  • Figure 2: The architecture consists of an Input Layer with the kernel-size 10 (window_size) x 10 (filter_length) x 3 (channels). The data is passed into 3 concatenated convolutional blocks, followed by a MaxPooling (kernel 2x1) where 50% of the data is filtered. The convolutional block consists of a convolutional layer with a variable kernel size of 5x1x(n*64) following a ReLU activation function and a BatchNorm-Layer. We decided to use a single LSTM-Layer with the size of 512 units, as mentioned by bock2021improving, which is followed by a Dropout-Layer that filters 30% of randomly selected samples of the window.
  • Figure 3: Leave-One-Day-Out Cross Validation. The models are personally trained for every participant and are not intended to generalize across all study participants. Instead, a generalization across all days of one week is desired.
  • Figure 4: This figure illustrates the relative prevalence of various activity classes within the dataset, excluding the void class, see Table \ref{['tab:class_distribution']} for details. The class labeled as void represents the predominant category within the dataset, surpassing the frequency of the second most prevalent class, desk_work, by a substantial factor of 13. The figure illustrates a pronounced imbalance in the data distribution, both in terms of the annotation methodology employed and the distinct week-specific patterns observed in the annotation process.
  • Figure 5: Missing annotations across all study participants and both weeks. The Y-axis shows the total number of annotations of one specific participant for the corresponding week. The color codes are as follows: Annotation is missing, Annotation is partially missing (start or stop time), Annotation is complete. The figure is inspired by brenner1999errors, Figure 1.
  • ...and 3 more figures