A Multi-Modal CNN-LSTM Framework with Multi-Head Attention and Focal Loss for Real-Time Elderly Fall Detection

Lijie Zhou; Luran Wang

A Multi-Modal CNN-LSTM Framework with Multi-Head Attention and Focal Loss for Real-Time Elderly Fall Detection

Lijie Zhou, Luran Wang

Abstract

The increasing global aging population has intensified the demand for reliable health monitoring systems, particularly those capable of detecting critical events such as falls among elderly individuals. Traditional fall detection approaches relying on single-modality acceleration data suffer from high false alarm rates, while conventional machine learning methods require extensive hand-crafted feature engineering. This paper proposes a novel multi-modal deep learning framework, MultiModalFallDetector, designed for real-time elderly fall detection using wearable sensors. Our approach integrates multiple innovations: a multi-scale CNN-based feature extractor capturing motion dynamics at varying temporal resolutions; fusion of tri-axial accelerometer, gyroscope, and four-channel physiological signals; incorporation of a multi-head self-attention mechanism for dynamic temporal weighting; adoption of Focal Loss to mitigate severe class imbalance; introduction of an auxiliary activity classification task for regularization; and implementation of transfer learning from UCI HAR to SisFall dataset. Extensive experiments on the SisFall dataset, which includes real-world simulated fall trials from elderly participants (aged 60-85), demonstrate that our framework achieves an F1-score of 98. 7, Recall of 98. 9, and AUC-ROC of 99. 4, significantly outperforming baseline methods including traditional machine learning and standard deep learning approaches. The model maintains sub- 50ms inference latency on edge devices, confirming its suitability for real-time deployment in geriatric care settings.

A Multi-Modal CNN-LSTM Framework with Multi-Head Attention and Focal Loss for Real-Time Elderly Fall Detection

Abstract

Paper Structure (40 sections, 3 equations, 6 figures, 4 tables)

This paper contains 40 sections, 3 equations, 6 figures, 4 tables.

Introduction
Contributions
Methodology
Problem Formulation
Model Architecture Overview
Multi-Scale Convolutional Feature Extractor
Physiological Signal Processing Module
Bidirectional LSTM for Temporal Dynamics Modeling
Multi-Head Self-Attention Mechanism
Dual-Task Output Heads
Loss Function Design
Data Preparation
Dataset Selection and Characteristics
Preprocessing Pipeline
Training-Validation-Test Split Strategy
...and 25 more sections

Figures (6)

Figure 1: Overview of the proposed MultiModalFallDetector architecture. The five-stage pipeline consists of input preprocessing and segmentation, modality-specific feature extraction via multi-scale CNNs, temporal dynamics modeling via bidirectional LSTM, contextual attention weighting via multi-head self-attention, and dual-head prediction for fall detection and activity classification.
Figure 2: Visual comparison of data augmentation techniques applied to accelerometer signals. The original signal is shown in blue, while augmented versions demonstrate the effects of jitter (noise addition), scaling (amplitude variation), and rotation (orientation change).
Figure 3: ROC curve comparison of the proposed MultiModalFallDetector against baseline methods. The proposed model achieves an AUC-ROC of 0.97, demonstrating superior discriminative performance across all classification thresholds.
Figure 4: Confusion matrix of the proposed MultiModalFallDetector on the SisFall test set. Diagonal elements represent correct classifications, with the model achieving high accuracy across all activity classes while maintaining excellent fall detection performance.
Figure 5: Attention weight visualization for fall and walking sequences. In the fall sequence (top), attention weights concentrate sharply around the impact moment (timestep $\sim$50). In the walking sequence (bottom), attention is distributed more uniformly across the periodic gait pattern.
...and 1 more figures

A Multi-Modal CNN-LSTM Framework with Multi-Head Attention and Focal Loss for Real-Time Elderly Fall Detection

Abstract

A Multi-Modal CNN-LSTM Framework with Multi-Head Attention and Focal Loss for Real-Time Elderly Fall Detection

Authors

Abstract

Table of Contents

Figures (6)