Temporal Action Localization for Inertial-based Human Activity Recognition

Marius Bock; Michael Moeller; Kristof Van Laerhoven

Temporal Action Localization for Inertial-based Human Activity Recognition

Marius Bock, Michael Moeller, Kristof Van Laerhoven

TL;DR

This paper is the first to systematically demonstrate the applicability of state-of-the-art TAL models for both offline and near-online Human Activity Recognition (HAR) using raw inertial data as well as pre-extracted latent features as input.

Abstract

As of today, state-of-the-art activity recognition from wearable sensors relies on algorithms being trained to classify fixed windows of data. In contrast, video-based Human Activity Recognition, known as Temporal Action Localization (TAL), has followed a segment-based prediction approach, localizing activity segments in a timeline of arbitrary length. This paper is the first to systematically demonstrate the applicability of state-of-the-art TAL models for both offline and near-online Human Activity Recognition (HAR) using raw inertial data as well as pre-extracted latent features as input. Offline prediction results show that TAL models are able to outperform popular inertial models on a multitude of HAR benchmark datasets, with improvements reaching as much as 26% in F1-score. We show that by analyzing timelines as a whole, TAL models can produce more coherent segments and achieve higher NULL-class accuracy across all datasets. We demonstrate that TAL is less suited for the immediate classification of small-sized windows of data, yet offers an interesting perspective on inertial-based HAR -- alleviating the need for fixed-size windows and enabling algorithms to recognize activities of arbitrary length. With design choices and training concepts yet to be explored, we argue that TAL architectures could be of significant value to the inertial-based HAR community. The code and data download to reproduce experiments is publicly available via github.com/mariusbock/tal_for_har.

Temporal Action Localization for Inertial-based Human Activity Recognition

TL;DR

Abstract

Paper Structure (19 sections, 3 equations, 7 figures, 4 tables)

This paper contains 19 sections, 3 equations, 7 figures, 4 tables.

Introduction
Related Work
Inertial-based Human Activity Recognition
Video-based Human Activity Recognition
Temporal Action Localization for Inertial-based HAR
Vectorization of raw inertial data
Two-stage training via prepended inertial models
TAL Architectures Overview
Methodology
Datasets
Training Pipeline
Prediction Scenarios
Hyperparameters
Postprocessing
Results
...and 4 more sections

Figures (7)

Figure 1: Overview of the prediction pipelines applied in inertial-based activity recognition and single-stage Temporal Action Localization (TAL). Both apply a sliding window to divide input data into windows of a certain duration (e.g. one second). TAL models do not use raw data as input but are applied on per-clip, pre-extracted feature embeddings. Inertial activity recognition models predict activity labels for each sliding window, which are used for calculating classification metrics such as accuracy and F1. TAL models predict activity segments, defined by a label, start and end points, and are evaluated with mean Average Precision (mAP) applied at different temporal Intersection over Union (tIoU) thresholds.
Figure 2: Visualization of the applied vectorization on top of windowed inertial data assuming four 3D-inertial sensors and a sliding window size of 50 samples. Each 2D-sliding-window of size $[50 \times 12]$ is vectorized by concatenating each of the axes one after another. Resulting 1D-embedding vectors, being of size $[1 \times 600]$, can be used to train TAL models.
Figure 3: Visualization of the applied two-stage training process. The first stage involves training e.g. a classic DeepConvLSTM as introduced by ordonezDeepConvolutionalLSTM2016. Once the first-stage training has finished, the classifier is omitted from the model such that latent features can be extracted. The 1-dimensional, window-wise features are then used as input embeddings for the second stage, i.e. training a TAL model.
Figure 4: Architecture overview of the ActionFormer proposed by zhangActionFormerLocalizingMoments2022. The architecture follows a encoder-decoder structure. The encoder encodes input sequences into a feature pyramid, which captures information at various temporal scales. The decoder, consisting of a classification and regression head, then decodes each timestamp within the feature pyramid to sequence labels, i.e. a class probability vector and the timestamp's activity onset and offset distance. The TriDetshiTriDetTemporalAction2023 and TemporalMaxertangTemporalMaxerMaximizeTemporal2023 both follow the same encoder-decoder structure, yet suggest architectural changes.
Figure 5: Offline Activity Recognition: Confusion matrices of the (a) the best TAL architecture (TriDet) shiTriDetTemporalAction2023 and (b) inertial model (shallow DeepConvLSTM) being applied on the SBHAR reyes-ortizTransitionAwareHumanActivity2016 (top) and RWHAR dataset sztylerOnbodyLocalizationWearable2016 (bottom) with a one second sliding window and 50% overlap. Note that confusions which are 0 are omitted.
...and 2 more figures

Temporal Action Localization for Inertial-based Human Activity Recognition

TL;DR

Abstract

Temporal Action Localization for Inertial-based Human Activity Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (7)