Table of Contents
Fetching ...

WEAR: An Outdoor Sports Dataset for Wearable and Egocentric Activity Recognition

Marius Bock, Hilde Kuehne, Kristof Van Laerhoven, Michael Moeller

TL;DR

WEAR is introduced, an outdoor sports dataset for both vision- and inertial-based human activity recognition (HAR) and their versatility of not only being trained using visual data, but also using raw inertial data and being capable to fuse both modalities by means of simple concatenation is demonstrated.

Abstract

Research has shown the complementarity of camera- and inertial-based data for modeling human activities, yet datasets with both egocentric video and inertial-based sensor data remain scarce. In this paper, we introduce WEAR, an outdoor sports dataset for both vision- and inertial-based human activity recognition (HAR). Data from 22 participants performing a total of 18 different workout activities was collected with synchronized inertial (acceleration) and camera (egocentric video) data recorded at 11 different outside locations. WEAR provides a challenging prediction scenario in changing outdoor environments using a sensor placement, in line with recent trends in real-world applications. Benchmark results show that through our sensor placement, each modality interestingly offers complementary strengths and weaknesses in their prediction performance. Further, in light of the recent success of single-stage Temporal Action Localization (TAL) models, we demonstrate their versatility of not only being trained using visual data, but also using raw inertial data and being capable to fuse both modalities by means of simple concatenation. The dataset and code to reproduce experiments is publicly available via: mariusbock.github.io/wear/.

WEAR: An Outdoor Sports Dataset for Wearable and Egocentric Activity Recognition

TL;DR

WEAR is introduced, an outdoor sports dataset for both vision- and inertial-based human activity recognition (HAR) and their versatility of not only being trained using visual data, but also using raw inertial data and being capable to fuse both modalities by means of simple concatenation is demonstrated.

Abstract

Research has shown the complementarity of camera- and inertial-based data for modeling human activities, yet datasets with both egocentric video and inertial-based sensor data remain scarce. In this paper, we introduce WEAR, an outdoor sports dataset for both vision- and inertial-based human activity recognition (HAR). Data from 22 participants performing a total of 18 different workout activities was collected with synchronized inertial (acceleration) and camera (egocentric video) data recorded at 11 different outside locations. WEAR provides a challenging prediction scenario in changing outdoor environments using a sensor placement, in line with recent trends in real-world applications. Benchmark results show that through our sensor placement, each modality interestingly offers complementary strengths and weaknesses in their prediction performance. Further, in light of the recent success of single-stage Temporal Action Localization (TAL) models, we demonstrate their versatility of not only being trained using visual data, but also using raw inertial data and being capable to fuse both modalities by means of simple concatenation. The dataset and code to reproduce experiments is publicly available via: mariusbock.github.io/wear/.
Paper Structure (21 sections, 7 figures, 2 tables)

This paper contains 21 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Snapshot along with descriptions of the annotation process using Final Cut Pro. Importing the converted video and inertial data (as '.wav'-files) allowed for an easy validation of the synchronization process. Labels were added via subtitles, exported as '.srt'-files and converted such that they can be appended to the respective '.csv'-files.
  • Figure 2: Visualization of the applied preprocessing on inertial and camera data in order to make to create a feature embedding which can be used to train the TriDet and ActionFormer network.
  • Figure 3: Confusion matrices of the TriDet model shiTriDetTemporalAction2023 being applied using inertial, vision (camera) and both combined (inertial + camera) with a one second sliding window and 50% overlap. Compared to inertial-based architectures bockImprovingDeepLearning2021abedinAttendDiscriminateStateoftheart2021 overall confusion (except for the NULL-class) is decreased. After combination strengths of each architecture are leveraged with e.g. jogging activities not getting confused anymore and overall confusion with the NULL-class decreases. Note that confusions which are 0 are omitted.
  • Figure 4: Color-coded comparison of the ground truth data of a sample participant with the best inertial-based (A-and-D), camera-based (TriDet) and fusion-based model (TriDet) along with an oracle combination of the best fusion-based model (O-LF(I, C)) as well as an oracle combination the best camera, inertial and fusion-based-model (O-LF(I, C, I + C)) using a sliding window approach of 1.0 seconds with a 50% overlap. The visualisation underlines the similarities amongst the predictive streams of O-LF(I, C) and the fusion-approach as well advantages of learning from both modalities simultaneously.
  • Figure 5: Average F1-score and mAP on the WEAR test set. The test features 4 unseen participants as well 2 reoccuring ones, a unseen location, different weather conditions and a new camera sensor. One can see that observed trends are similar to those seen during LOSO cross-validation.
  • ...and 2 more figures