A Novel Deep Hybrid Framework with Ensemble-Based Feature Optimization for Robust Real-Time Human Activity Recognition
Wasi Ullah, Yasir Noman Khalid, Saddam Hussain Khan
TL;DR
This work tackles real-time HAR scalability by merging spatial and temporal learning with efficient feature pruning. A customized Inception-V3 captures fine-grained spatial cues, while an Attention-Augmented LSTM encodes temporal dynamics, creating rich spatio-temporal features. A novel Adaptive Dynamic Fitness Sharing with Attention (ADFSA) uses a GA to select a compact, diverse feature subset (e.g., reducing from 128 to 7 features) while balancing accuracy, redundancy, uniqueness, and complexity, yielding state-of-the-art performance (up to $99.65\%$ accuracy) with significantly reduced inference time. The approach enables robust, edge-friendly HAR, demonstrating strong generalization across occlusion and cluttered environments and offering a scalable path for deployment on resource-constrained devices.
Abstract
Real-time Human Activity Recognition (HAR) has wide-ranging applications in areas such as context-aware environments, public safety, assistive technologies, and autonomous monitoring and surveillance systems. However, existing real-time HAR systems face significant challenges, including limited scalability and high computational costs arising from redundant features. To address these issues, the Inception-V3 model was customized with region-based and boundary-aware operations, using average pooling and max pooling, respectively, to enhance region homogeneity, suppress noise, and capture discriminative local features, while improving robustness through down-sampling. Furthermore, to effectively encode motion dynamics, an Attention-Augmented Long Short-Term Memory (AA-LSTM) network was employed to learn temporal dependencies across video frames. Features are extracted from video dataset and are then optimized through a novel proposed dynamic composite feature selection method called Adaptive Dynamic Fitness Sharing and Attention (ADFSA). This ADFSA mechanism is embedded within a genetic algorithm to select a compact, optimized subset of features by dynamically balancing multiple objectives, accuracy, redundancy reduction, feature uniqueness, and complexity minimization. As a result, the selected subset of diverse and discriminative features enables lightweight machine learning classifiers to achieve accurate and robust HAR in heterogeneous environments. Experimental results demonstrate up to 99.65\% accuracy using as few as seven selected features, with improved inference time on the challenging UCF-YouTube dataset, which includes factors such as occlusion, cluttered backgrounds, complex motion dynamics, and poor illumination conditions.
