Towards Adaptive Fusion of Multimodal Deep Networks for Human Action Recognition
Novanto Yudistira
TL;DR
The paper addresses the challenge of robust human action recognition in active and assisted living (AAL) by proposing adaptive fusion across multimodal data streams (vision, audio, sensors, text) facilitated by gating networks. It integrates modern components such as large language models and multimodal LLMs as contextual reasoning and fusion adapters, and validates the approach on action recognition, violence detection, and self-supervised learning tasks. Key contributions include a gating-based multi-stream framework with softmax-normalized modality weights, empirical demonstrations of high accuracy (e.g., 91% on RGB+OF action recognition and 90.5% on violence detection), and discussion of omnidirectional cameras for 360-degree HAR with privacy safeguards. The work highlights practical implications for real-time, context-aware assistive systems and outlines challenges and future directions in privacy, efficiency, and dataset standardization for real-world deployment.
Abstract
This study introduces a pioneering methodology for human action recognition by harnessing deep neural network techniques and adaptive fusion strategies across multiple modalities, including RGB, optical flows, audio, and depth information. Employing gating mechanisms for multimodal fusion, we aim to surpass limitations inherent in traditional unimodal recognition methods while exploring novel possibilities for diverse applications. Through an exhaustive investigation of gating mechanisms and adaptive weighting-based fusion architectures, our methodology enables the selective integration of relevant information from various modalities, thereby bolstering both accuracy and robustness in action recognition tasks. We meticulously examine various gated fusion strategies to pinpoint the most effective approach for multimodal action recognition, showcasing its superiority over conventional unimodal methods. Gating mechanisms facilitate the extraction of pivotal features, resulting in a more holistic representation of actions and substantial enhancements in recognition performance. Our evaluations across human action recognition, violence action detection, and multiple self-supervised learning tasks on benchmark datasets demonstrate promising advancements in accuracy. The significance of this research lies in its potential to revolutionize action recognition systems across diverse fields. The fusion of multimodal information promises sophisticated applications in surveillance and human-computer interaction, especially in contexts related to active assisted living.
