Table of Contents
Fetching ...

Multiscaled Multi-Head Attention-based Video Transformer Network for Hand Gesture Recognition

Mallika Garg, Debashis Ghosh, Pyari Mohan Pradhan

TL;DR

Dynamic hand gesture recognition contends with pose, scale, and shape variation. The authors introduce MsMHA-VTN, a video transformer that leverages a pyramidal, multiscaled multi-head attention to learn multiscale spatio-temporal features from frame-level backbones such as ResNet-18. They further explore decision-level multimodal fusion across color, depth, infrared, normals, and optical flow, achieving state-of-the-art results on NVGesture (88.22%) and Briareo (99.10%). Extensive experiments validate the efficacy of the multiscale heads and late fusion strategy for robust gesture recognition. The approach holds practical potential for real-time, multisensor hand-gesture understanding in dynamic interaction scenarios.

Abstract

Dynamic gesture recognition is one of the challenging research areas due to variations in pose, size, and shape of the signer's hand. In this letter, Multiscaled Multi-Head Attention Video Transformer Network (MsMHA-VTN) for dynamic hand gesture recognition is proposed. A pyramidal hierarchy of multiscale features is extracted using the transformer multiscaled head attention model. The proposed model employs different attention dimensions for each head of the transformer which enables it to provide attention at the multiscale level. Further, in addition to single modality, recognition performance using multiple modalities is examined. Extensive experiments demonstrate the superior performance of the proposed MsMHA-VTN with an overall accuracy of 88.22\% and 99.10\% on NVGesture and Briareo datasets, respectively.

Multiscaled Multi-Head Attention-based Video Transformer Network for Hand Gesture Recognition

TL;DR

Dynamic hand gesture recognition contends with pose, scale, and shape variation. The authors introduce MsMHA-VTN, a video transformer that leverages a pyramidal, multiscaled multi-head attention to learn multiscale spatio-temporal features from frame-level backbones such as ResNet-18. They further explore decision-level multimodal fusion across color, depth, infrared, normals, and optical flow, achieving state-of-the-art results on NVGesture (88.22%) and Briareo (99.10%). Extensive experiments validate the efficacy of the multiscale heads and late fusion strategy for robust gesture recognition. The approach holds practical potential for real-time, multisensor hand-gesture understanding in dynamic interaction scenarios.

Abstract

Dynamic gesture recognition is one of the challenging research areas due to variations in pose, size, and shape of the signer's hand. In this letter, Multiscaled Multi-Head Attention Video Transformer Network (MsMHA-VTN) for dynamic hand gesture recognition is proposed. A pyramidal hierarchy of multiscale features is extracted using the transformer multiscaled head attention model. The proposed model employs different attention dimensions for each head of the transformer which enables it to provide attention at the multiscale level. Further, in addition to single modality, recognition performance using multiple modalities is examined. Extensive experiments demonstrate the superior performance of the proposed MsMHA-VTN with an overall accuracy of 88.22\% and 99.10\% on NVGesture and Briareo datasets, respectively.
Paper Structure (10 sections, 6 equations, 2 figures, 6 tables)

This paper contains 10 sections, 6 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: The proposed Multiscaled Multi-Head Attention Video Transformer Network (MsMHA-VTN) with 6 transformer stages (T1-T6) and 8 heads, where $N=L\times D$ is the dimension of the first head in each attention vector, $D$ and $L$ are the dimensions of the input tensor. For convenience, pyramid scaling at each head by a factor of $1/2$ is shown only for three heads.
  • Figure 2: The proposed Mutiscaled Multi-Head pyramid attention. Pyramid attention is an attention mechanism that defines the length of the query, key and value in a pyramid pattern. This allows to exploit the multi-scale information at different heads. Here, the size of linear block is kept varying w.r.t heads to show that $d_k$ decreases gradually.