Multiscaled Multi-Head Attention-based Video Transformer Network for Hand Gesture Recognition
Mallika Garg, Debashis Ghosh, Pyari Mohan Pradhan
TL;DR
Dynamic hand gesture recognition contends with pose, scale, and shape variation. The authors introduce MsMHA-VTN, a video transformer that leverages a pyramidal, multiscaled multi-head attention to learn multiscale spatio-temporal features from frame-level backbones such as ResNet-18. They further explore decision-level multimodal fusion across color, depth, infrared, normals, and optical flow, achieving state-of-the-art results on NVGesture (88.22%) and Briareo (99.10%). Extensive experiments validate the efficacy of the multiscale heads and late fusion strategy for robust gesture recognition. The approach holds practical potential for real-time, multisensor hand-gesture understanding in dynamic interaction scenarios.
Abstract
Dynamic gesture recognition is one of the challenging research areas due to variations in pose, size, and shape of the signer's hand. In this letter, Multiscaled Multi-Head Attention Video Transformer Network (MsMHA-VTN) for dynamic hand gesture recognition is proposed. A pyramidal hierarchy of multiscale features is extracted using the transformer multiscaled head attention model. The proposed model employs different attention dimensions for each head of the transformer which enables it to provide attention at the multiscale level. Further, in addition to single modality, recognition performance using multiple modalities is examined. Extensive experiments demonstrate the superior performance of the proposed MsMHA-VTN with an overall accuracy of 88.22\% and 99.10\% on NVGesture and Briareo datasets, respectively.
