Table of Contents
Fetching ...

SDFA: Structure Aware Discriminative Feature Aggregation for Efficient Human Fall Detection in Video

Sania Zahan, Ghulam Mubashar Hassan, Ajmal Mian

TL;DR

SDFA addresses privacy-conscious fall detection by using 2D skeletons extracted from low-resolution video and a lightweight graph-based architecture. The method combines joint and motion streams in a shared space and leverages a Spatial Graph Convolutional Network with a learnable adjacency plus Separable Temporal Convolutions, augmented by randomized spatio-temporal masking and early fusion. Across five large-scale datasets, SDFA delivers competitive accuracy with substantially lower FLOPS and fewer parameters than prior methods, enabling real-time deployment on low-cost cameras without sacrificing privacy. The work demonstrates strong generalization and practical impact for smart healthcare monitoring in homes and care facilities. Key innovations include adaptive adjacency learning, efficient temporal modeling, and robust regularization to handle diverse real-world scenarios.

Abstract

Older people are susceptible to fall due to instability in posture and deteriorating health. Immediate access to medical support can greatly reduce repercussions. Hence, there is an increasing interest in automated fall detection, often incorporated into a smart healthcare system to provide better monitoring. Existing systems focus on wearable devices which are inconvenient or video monitoring which has privacy concerns. Moreover, these systems provide a limited perspective of their generalization ability as they are tested on datasets containing few activities that have wide disparity in the action space and are easy to differentiate. Complex daily life scenarios pose much greater challenges with activities that overlap in action spaces due to similar posture or motion. To overcome these limitations, we propose a fall detection model, coined SDFA, based on human skeletons extracted from low-resolution videos. The use of skeleton data ensures privacy and low-resolution videos ensures low hardware and computational cost. Our model captures discriminative structural displacements and motion trends using unified joint and motion features projected onto a shared high dimensional space. Particularly, the use of separable convolution combined with a powerful GCN architecture provides improved performance. Extensive experiments on five large-scale datasets with a wide range of evaluation settings show that our model achieves competitive performance with extremely low computational complexity and runs faster than existing models.

SDFA: Structure Aware Discriminative Feature Aggregation for Efficient Human Fall Detection in Video

TL;DR

SDFA addresses privacy-conscious fall detection by using 2D skeletons extracted from low-resolution video and a lightweight graph-based architecture. The method combines joint and motion streams in a shared space and leverages a Spatial Graph Convolutional Network with a learnable adjacency plus Separable Temporal Convolutions, augmented by randomized spatio-temporal masking and early fusion. Across five large-scale datasets, SDFA delivers competitive accuracy with substantially lower FLOPS and fewer parameters than prior methods, enabling real-time deployment on low-cost cameras without sacrificing privacy. The work demonstrates strong generalization and practical impact for smart healthcare monitoring in homes and care facilities. Key innovations include adaptive adjacency learning, efficient temporal modeling, and robust regularization to handle diverse real-world scenarios.

Abstract

Older people are susceptible to fall due to instability in posture and deteriorating health. Immediate access to medical support can greatly reduce repercussions. Hence, there is an increasing interest in automated fall detection, often incorporated into a smart healthcare system to provide better monitoring. Existing systems focus on wearable devices which are inconvenient or video monitoring which has privacy concerns. Moreover, these systems provide a limited perspective of their generalization ability as they are tested on datasets containing few activities that have wide disparity in the action space and are easy to differentiate. Complex daily life scenarios pose much greater challenges with activities that overlap in action spaces due to similar posture or motion. To overcome these limitations, we propose a fall detection model, coined SDFA, based on human skeletons extracted from low-resolution videos. The use of skeleton data ensures privacy and low-resolution videos ensures low hardware and computational cost. Our model captures discriminative structural displacements and motion trends using unified joint and motion features projected onto a shared high dimensional space. Particularly, the use of separable convolution combined with a powerful GCN architecture provides improved performance. Extensive experiments on five large-scale datasets with a wide range of evaluation settings show that our model achieves competitive performance with extremely low computational complexity and runs faster than existing models.

Paper Structure

This paper contains 20 sections, 2 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: SDFA architecture: Firstly, skeleton joints are extracted from video streams using OpenPose (Section \ref{['openpose']}). Then, linear projections of joint and motion streams are added to create a dynamic representation of the input skeletons (Section \ref{['input_stream']}). The compact feature vector is then processed through spatial graph convolution (Section \ref{['sgcn_method']}) and separable temporal convolution (Section \ref{['sep_tcn_method']}) to encode local and global context aggregation over neighbouring joints and frames. Finally, global average pooling is performed over the encoded feature vector for classification.
  • Figure 2: Skeleton extraction using OpenPose, (a) detected keypoints rendered over original video frame (b) extracted 2D joint coordinates.
  • Figure 3: Depthwise separable convolution: (a) filtering operations split into two, depthwise and pointwise convolution. (b) represents a basic depthwise separable convolution layer used in the proposed model MobileNets2017.
  • Figure 4: Randomized masking. (a) Spatial joints (the three white ones) are masked. (b) Temporal frames (number 2, 5, and 6) are masked.
  • Figure 5: Samples of Fall and Lying down activities from UWA3D dataset. (a)-(b) are superimposed RGB frames and (c)-(d) are corresponding skeleton joint frames. As the colour indicates, both activities have similar postures differing only in their temporal occurrence. Similar skeleton poses for Lying down are spread out further over time indicating longer duration and moderate transition speed (deeper colour indicates higher frame number, therefore a later time of occurrence).