Table of Contents
Fetching ...

Fusing Biomechanical and Spatio-Temporal Features for Fall Prediction: Characterizing and Mitigating the Simulation-to-Reality Gap

Md Fokhrul Islam, Sajeda Al-Hammouri, Christopher J. Arellano, Kavan Hazeli, Heman Shakeri

TL;DR

This work tackles imminent fall prediction from vision data, addressing the scarcity of real fall data and a persistent simulation–reality gap. It introduces BioST-GCN, a dual-stream architecture that fuses pose-based spatio-temporal features (via an ST-GCN with Body Attention) with engineered biomechanical features processed by a BiLSTM, connected through a cross-attention AttFusion module. BioST-GCN achieves superior intra-subject performance (F1 ~89.1%, AUPRC ~91.1%) and demonstrates clear improvements over vanilla ST-GCN, while revealing a substantial zero-shot generalization drop (~35.9% F1) when transferring to unseen subjects; few-shot personalization shows rapid performance gains, underscoring the need for model personalization and richer, bias-aware data. The study highlights the critical simulation–reality gap for fall prediction in elderly populations and calls for privacy-preserving data pipelines and domain adaptation strategies to translate these advances into clinically reliable tools.

Abstract

Falls are a leading cause of injury and loss of independence among older adults. Vision-based fall prediction systems offer a non-invasive solution to anticipate falls seconds before impact, but their development is hindered by the scarcity of available fall data. Contributing to these efforts, this study proposes the Biomechanical Spatio-Temporal Graph Convolutional Network (BioST-GCN), a dual-stream model that combines both pose and biomechanical information using a cross-attention fusion mechanism. Our model outperforms the vanilla ST-GCN baseline by 5.32% and 2.91% F1-score on the simulated MCF-UA stunt-actor and MUVIM datasets, respectively. The spatio-temporal attention mechanisms in the ST-GCN stream also provide interpretability by identifying critical joints and temporal phases. However, a critical simulation-reality gap persists. While our model achieves an 89.0% F1-score with full supervision on simulated data, zero-shot generalization to unseen subjects drops to 35.9%. This performance decline is likely due to biases in simulated data, such as 'intent-to-fall' cues. For older adults, particularly those with diabetes or frailty, this gap is exacerbated by their unique kinematic profiles. To address this, we propose personalization strategies and advocate for privacy-preserving data pipelines to enable real-world validation. Our findings underscore the urgent need to bridge the gap between simulated and real-world data to develop effective fall prediction systems for vulnerable elderly populations.

Fusing Biomechanical and Spatio-Temporal Features for Fall Prediction: Characterizing and Mitigating the Simulation-to-Reality Gap

TL;DR

This work tackles imminent fall prediction from vision data, addressing the scarcity of real fall data and a persistent simulation–reality gap. It introduces BioST-GCN, a dual-stream architecture that fuses pose-based spatio-temporal features (via an ST-GCN with Body Attention) with engineered biomechanical features processed by a BiLSTM, connected through a cross-attention AttFusion module. BioST-GCN achieves superior intra-subject performance (F1 ~89.1%, AUPRC ~91.1%) and demonstrates clear improvements over vanilla ST-GCN, while revealing a substantial zero-shot generalization drop (~35.9% F1) when transferring to unseen subjects; few-shot personalization shows rapid performance gains, underscoring the need for model personalization and richer, bias-aware data. The study highlights the critical simulation–reality gap for fall prediction in elderly populations and calls for privacy-preserving data pipelines and domain adaptation strategies to translate these advances into clinically reliable tools.

Abstract

Falls are a leading cause of injury and loss of independence among older adults. Vision-based fall prediction systems offer a non-invasive solution to anticipate falls seconds before impact, but their development is hindered by the scarcity of available fall data. Contributing to these efforts, this study proposes the Biomechanical Spatio-Temporal Graph Convolutional Network (BioST-GCN), a dual-stream model that combines both pose and biomechanical information using a cross-attention fusion mechanism. Our model outperforms the vanilla ST-GCN baseline by 5.32% and 2.91% F1-score on the simulated MCF-UA stunt-actor and MUVIM datasets, respectively. The spatio-temporal attention mechanisms in the ST-GCN stream also provide interpretability by identifying critical joints and temporal phases. However, a critical simulation-reality gap persists. While our model achieves an 89.0% F1-score with full supervision on simulated data, zero-shot generalization to unseen subjects drops to 35.9%. This performance decline is likely due to biases in simulated data, such as 'intent-to-fall' cues. For older adults, particularly those with diabetes or frailty, this gap is exacerbated by their unique kinematic profiles. To address this, we propose personalization strategies and advocate for privacy-preserving data pipelines to enable real-world validation. Our findings underscore the urgent need to bridge the gap between simulated and real-world data to develop effective fall prediction systems for vulnerable elderly populations.

Paper Structure

This paper contains 25 sections, 10 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Fall Prediction Model Pipeline. The system extracts 3D pose landmarks from video, segments them, and feeds them into a dual-stream network. One stream uses ST-GCN for pose dynamics; the other uses a BiLSTM for engineered biomechanical features. An attention mechanism fuses features from both streams, followed by fully connected layers for fall probability prediction.
  • Figure 2: Performance comparison across prediction horizons (0-4 seconds before fall). (a) F1-Score and (b) AUPRC for BioST-GCN and LSTM baseline in same-subject (split 1) and cross-subject (split 2) settings. Error bars represent standard deviation over $5$ independent runs. BioST-GCN demonstrates superior performance in same-subject settings and robust generalization in cross-subject evaluation, with consistently lower variance than LSTM.
  • Figure 3: Attention mechanisms in fall prediction. (a) Attention values (heat map) for different joints and scenarios (e.g., fall vs. non-fall), where color intensity represents attention strength. (b) Attention distribution across body joints in sequential ST-GCN blocks. The upper row depicts attention in an initial ST-GCN block, while the lower row shows attention in a deeper ST-GCN block. Red circles (attention values $> 0.3$) and blue circles (attention values $< 0.3$) indicate the magnitude of attention received by individual joints, with circle size corresponding to attention magnitude.
  • Figure 4: (a) Performance comparison with varying window sizes. (b) F1-Score performance for different numbers of STGCN blocks and LSTM layers.
  • Figure 5: Illustration of the anatomical reference frame and body landmarks. The principal axes (frontal, sagittal, longitudinal) are shown on the left. The frontal plane is defined by chest and pelvis points.