Table of Contents
Fetching ...

Action Recognition in Real-World Ambient Assisted Living Environment

Vincent Gbouna Zakka, Zhuangzhuang Dai, Luis J. Manso

TL;DR

This work tackles action recognition in real-world ambient assisted living by addressing occlusion, noise, and limited computational resources in skeleton-based HAR. It introduces RE-TCN, a robust and efficient temporal convolution network that leverages Adaptive Temporal Weighting to emphasize informative frames, Depthwise Separable Convolutions to reduce parameters, and data augmentation to improve resilience. Across four benchmarks, RE-TCN achieves state-of-the-art accuracy and demonstrates strong robustness to occlusion and noise, while enabling real-time inference on CPU devices, supporting privacy-preserving, in-home monitoring. The approach holds practical potential for reliable, privacy-conscious monitoring in care settings, with future work focusing on dataset diversity from care homes and field deployments to validate real-world utility.

Abstract

The growing ageing population and their preference to maintain independence by living in their own homes require proactive strategies to ensure safety and support. Ambient Assisted Living (AAL) technologies have emerged to facilitate ageing in place by offering continuous monitoring and assistance within the home. Within AAL technologies, action recognition plays a crucial role in interpreting human activities and detecting incidents like falls, mobility decline, or unusual behaviours that may signal worsening health conditions. However, action recognition in practical AAL applications presents challenges, including occlusions, noisy data, and the need for real-time performance. While advancements have been made in accuracy, robustness to noise, and computation efficiency, achieving a balance among them all remains a challenge. To address this challenge, this paper introduces the Robust and Efficient Temporal Convolution network (RE-TCN), which comprises three main elements: Adaptive Temporal Weighting (ATW), Depthwise Separable Convolutions (DSC), and data augmentation techniques. These elements aim to enhance the model's accuracy, robustness against noise and occlusion, and computational efficiency within real-world AAL contexts. RE-TCN outperforms existing models in terms of accuracy, noise and occlusion robustness, and has been validated on four benchmark datasets: NTU RGB+D 60, Northwestern-UCLA, SHREC'17, and DHG-14/28. The code is publicly available at: https://github.com/Gbouna/RE-TCN

Action Recognition in Real-World Ambient Assisted Living Environment

TL;DR

This work tackles action recognition in real-world ambient assisted living by addressing occlusion, noise, and limited computational resources in skeleton-based HAR. It introduces RE-TCN, a robust and efficient temporal convolution network that leverages Adaptive Temporal Weighting to emphasize informative frames, Depthwise Separable Convolutions to reduce parameters, and data augmentation to improve resilience. Across four benchmarks, RE-TCN achieves state-of-the-art accuracy and demonstrates strong robustness to occlusion and noise, while enabling real-time inference on CPU devices, supporting privacy-preserving, in-home monitoring. The approach holds practical potential for reliable, privacy-conscious monitoring in care settings, with future work focusing on dataset diversity from care homes and field deployments to validate real-world utility.

Abstract

The growing ageing population and their preference to maintain independence by living in their own homes require proactive strategies to ensure safety and support. Ambient Assisted Living (AAL) technologies have emerged to facilitate ageing in place by offering continuous monitoring and assistance within the home. Within AAL technologies, action recognition plays a crucial role in interpreting human activities and detecting incidents like falls, mobility decline, or unusual behaviours that may signal worsening health conditions. However, action recognition in practical AAL applications presents challenges, including occlusions, noisy data, and the need for real-time performance. While advancements have been made in accuracy, robustness to noise, and computation efficiency, achieving a balance among them all remains a challenge. To address this challenge, this paper introduces the Robust and Efficient Temporal Convolution network (RE-TCN), which comprises three main elements: Adaptive Temporal Weighting (ATW), Depthwise Separable Convolutions (DSC), and data augmentation techniques. These elements aim to enhance the model's accuracy, robustness against noise and occlusion, and computational efficiency within real-world AAL contexts. RE-TCN outperforms existing models in terms of accuracy, noise and occlusion robustness, and has been validated on four benchmark datasets: NTU RGB+D 60, Northwestern-UCLA, SHREC'17, and DHG-14/28. The code is publicly available at: https://github.com/Gbouna/RE-TCN

Paper Structure

This paper contains 36 sections, 21 equations, 5 figures, 17 tables, 3 algorithms.

Figures (5)

  • Figure 1: Challenges in a real-world environment: A) represents data with noise and occlusion, B) represents relatively clean data, and C) represents data with occlusion
  • Figure 2: Architecture of the proposed RE-TCN: Graph convolution is first applied to the skeleton sequences. The output is then passed to the multi-branch temporal convolution, followed by the ATW mechanism, and finally to the classification module for action recognition. DSC and ATW denote Depthwise Separable Convolution and Adaptive Temporal Weighting, respectively.
  • Figure 3: A skeleton sample of "Cross Arm" action with data augmentation techniques: jittering, random occlusion, frame occlusion, and rotation
  • Figure 4: Confusion matrices showing classification performance for each class in the NTU RGB+D 60, Northwestern-UCLA, SHREC'17, and DHG-14/28 datasets
  • Figure 5: Real time action recognition: A) action recognition without occlusions B) action recognition with occlusions. Predicted action: The action recognised by the model. Inference time: The time taken for the model to generate a prediction. Inference+Processing time: The total processing time that spans from frame capture, pose extraction, and model prediction