Table of Contents
Fetching ...

DeepConvContext: A Multi-Scale Approach to Timeseries Classification in Human Activity Recognition

Marius Bock, Michael Moeller, Kristof Van Laerhoven

TL;DR

HAR learning traditionally relies on sliding windows, which constrain temporal context. DeepConvContext introduces a multi-scale framework that combines intra-window CNN+LSTM feature extraction with an inter-window LSTM to capture both local and long-range temporal dependencies, with bidirectional inter-patch modeling further boosting performance. Across six HAR benchmarks, it achieves about a 10% average F1-score gain over DeepConvLSTM (up to 21%), and ablations show LSTMs consistently outperform attention- and Transformer-based alternatives for inertial data. This work advances practical HAR by enabling more coherent inter-window reasoning while maintaining competitive computational profiles, and code is publicly available for reproducibility.

Abstract

Despite recognized limitations in modeling long-range temporal dependencies, Human Activity Recognition (HAR) has traditionally relied on a sliding window approach to segment labeled datasets. Deep learning models like the DeepConvLSTM typically classify each window independently, thereby restricting learnable temporal context to within-window information. To address this constraint, we propose DeepConvContext, a multi-scale time series classification framework for HAR. Drawing inspiration from the vision-based Temporal Action Localization community, DeepConvContext models both intra- and inter-window temporal patterns by processing sequences of time-ordered windows. Unlike recent HAR models that incorporate attention mechanisms, DeepConvContext relies solely on LSTMs -- with ablation studies demonstrating the superior performance of LSTMs over attention-based variants for modeling inertial sensor data. Across six widely-used HAR benchmarks, DeepConvContext achieves an average 10% improvement in F1-score over the classic DeepConvLSTM, with gains of up to 21%. Code to reproduce our experiments is publicly available via github.com/mariusbock/context_har.

DeepConvContext: A Multi-Scale Approach to Timeseries Classification in Human Activity Recognition

TL;DR

HAR learning traditionally relies on sliding windows, which constrain temporal context. DeepConvContext introduces a multi-scale framework that combines intra-window CNN+LSTM feature extraction with an inter-window LSTM to capture both local and long-range temporal dependencies, with bidirectional inter-patch modeling further boosting performance. Across six HAR benchmarks, it achieves about a 10% average F1-score gain over DeepConvLSTM (up to 21%), and ablations show LSTMs consistently outperform attention- and Transformer-based alternatives for inertial data. This work advances practical HAR by enabling more coherent inter-window reasoning while maintaining competitive computational profiles, and code is publicly available for reproducibility.

Abstract

Despite recognized limitations in modeling long-range temporal dependencies, Human Activity Recognition (HAR) has traditionally relied on a sliding window approach to segment labeled datasets. Deep learning models like the DeepConvLSTM typically classify each window independently, thereby restricting learnable temporal context to within-window information. To address this constraint, we propose DeepConvContext, a multi-scale time series classification framework for HAR. Drawing inspiration from the vision-based Temporal Action Localization community, DeepConvContext models both intra- and inter-window temporal patterns by processing sequences of time-ordered windows. Unlike recent HAR models that incorporate attention mechanisms, DeepConvContext relies solely on LSTMs -- with ablation studies demonstrating the superior performance of LSTMs over attention-based variants for modeling inertial sensor data. Across six widely-used HAR benchmarks, DeepConvContext achieves an average 10% improvement in F1-score over the classic DeepConvLSTM, with gains of up to 21%. Code to reproduce our experiments is publicly available via github.com/mariusbock/context_har.

Paper Structure

This paper contains 14 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of the proposed DeepConvContext architecture. The architecture follows a multi-scale approach in which an input timeseries is segmented into equal-sized patches. These patches are individually processed by a DeepConvLSTM-like feature extraction, i.e. a combination of multiple convolution and a LSTM. The resulting intra-patch temporal context vectors are dimensionally reduced to a 1-dimensional feature vector. The sequence of feature vectors of all patches are then passed to a second LSTM, to learn inter-patch temporal features. Resulting patch-wise feature vectors are then classified and a sequence of patch-wise activity labels is returned.
  • Figure 2: Average F1-score and mAP results of the DeepConvLSTM ordonezDeepConvolutionalLSTM2016, Shallow DeepConvLSTM bockImprovingDeepLearning2021 and proposed DeepConvContext being applied to the WEAR bockWEAROutdoorSports2024, Wetlab schollWearablesWetLab2015, Hang-Time hoelzemannHangTimeHARBenchmark2023, RWHAR sztylerOnBodyLocalizationWearable2016, Opportunity roggenCollectingComplexActivity2010 and SBHAR reyes-ortizTransitionAwareHumanActivity2016. The DeepConvContext is additionally evaluated using bidirectional LSTM to perform inter-window learning. Results are the class- and participant-averaged scores averaged across three runs using different random seeds. One can see that the DeepConvContext combines strengths of both architectures and improves upon results across all datasets, with the bidirectional version of the architecture providing the highest F1-score and mAP.
  • Figure 3: Per-class confusion matrices of the DeepConvLSTM, Shallow DeepConvLSTM and DeepConvContext being applied to the SBHAR dataset using LOSO cross-validation. The DeepConvContext is further applied using bidirectional LSTM as described in Chapter \ref{['subsec:variants']}. One can see that the DeepConvContext improves upon both variants of the DeepConvLSTM, with the bidirectional variant producing the overall highest prediction results. Especially transition classes such as sit-to-stand are more reliably detected.