A Matter of Annotation: An Empirical Study on In Situ and Self-Recall Activity Annotations from Wearable Sensors
Alexander Hoelzemann, Kristof Van Laerhoven
TL;DR
This study interrogates how four annotation methods for wearable-sensor HAR data influence label quality and downstream classifier performance in real-world settings. Using 11 participants over two weeks and a DeepConvLSTM classifier, the authors show that in-situ labeling yields precise segments but suffers from missing annotations, while diary-based approaches struggle with precision. Importantly, adding a time-series visualization tool (MAD-GUI) to a diary-based workflow reduced missing labels and, when combined with a data-grounded labeling process, improved F1 scores by up to 8% (to as high as 90.4%), underscoring the value of user-friendly annotation aids. The findings highlight annotation biases and offer actionable guidance for designing labeling workflows in long-term, in-the-wild HAR studies, with implications for privacy-preserving, high-quality ground-truth generation and model training.
Abstract
Research into the detection of human activities from wearable sensors is a highly active field, benefiting numerous applications, from ambulatory monitoring of healthcare patients via fitness coaching to streamlining manual work processes. We present an empirical study that evaluates and contrasts four commonly employed annotation methods in user studies focused on in-the-wild data collection. For both the user-driven, in situ annotations, where participants annotate their activities during the actual recording process, and the recall methods, where participants retrospectively annotate their data at the end of each day, the participants had the flexibility to select their own set of activity classes and corresponding labels. Our study illustrates that different labeling methodologies directly impact the annotations' quality, as well as the capabilities of a deep learning classifier trained with the data. We noticed that in situ methods produce less but more precise labels than recall methods. Furthermore, we combined an activity diary with a visualization tool that enables the participant to inspect and label their activity data. Due to the introduction of such a tool were able to decrease missing annotations and increase the annotation consistency, and therefore the F1-Score of the deep learning model by up to 8% (ranging between 82.1 and 90.4% F1-Score). Furthermore, we discuss the advantages and disadvantages of the methods compared in our study, the biases they could introduce, and the consequences of their usage on human activity recognition studies as well as possible solutions.
