Table of Contents
Fetching ...

MVTN: A Multiscale Video Transformer Network for Hand Gesture Recognition

Mallika Garg, Debashis Ghosh, Pyari Mohan Pradhan

TL;DR

This work tackles robust dynamic hand gesture recognition from video by addressing scale and pose variation that challenge traditional transformers. It introduces MVTN, a convolution-free multiscale video transformer that builds a pyramid of attention across six stages through linear projections, enabling learning of both high- and low-resolution features with reduced computation. The model further uses Spatial-Reduction Attention to cut complexity and applies a late fusion scheme to leverage multiple sensing modalities (e.g., RGB, depth, IR, normals). Experiments on NVGesture and Briareo demonstrate state-of-the-art accuracy with fewer parameters, including a best NVGesture performance of 87.80% with triple-modal input and Briareo reaching up to 98.61% with multi-modal fusion, highlighting practical gains for multimodal hand gesture recognition.

Abstract

In this paper, we introduce a novel Multiscale Video Transformer Network (MVTN) for dynamic hand gesture recognition, since multiscale features can extract features with variable size, pose, and shape of hand which is a challenge in hand gesture recognition. The proposed model incorporates a multiscale feature hierarchy to capture diverse levels of detail and context within hand gestures which enhances the model's ability. This multiscale hierarchy is obtained by extracting different dimensions of attention in different transformer stages with initial stages to model high-resolution features and later stages to model low-resolution features. Our approach also leverages multimodal data, utilizing depth maps, infrared data, and surface normals along with RGB images from NVGesture and Briareo datasets. Experiments show that the proposed MVTN achieves state-of-the-art results with less computational complexity and parameters. The source code is available at https://github.com/mallikagarg/MVTN.

MVTN: A Multiscale Video Transformer Network for Hand Gesture Recognition

TL;DR

This work tackles robust dynamic hand gesture recognition from video by addressing scale and pose variation that challenge traditional transformers. It introduces MVTN, a convolution-free multiscale video transformer that builds a pyramid of attention across six stages through linear projections, enabling learning of both high- and low-resolution features with reduced computation. The model further uses Spatial-Reduction Attention to cut complexity and applies a late fusion scheme to leverage multiple sensing modalities (e.g., RGB, depth, IR, normals). Experiments on NVGesture and Briareo demonstrate state-of-the-art accuracy with fewer parameters, including a best NVGesture performance of 87.80% with triple-modal input and Briareo reaching up to 98.61% with multi-modal fusion, highlighting practical gains for multimodal hand gesture recognition.

Abstract

In this paper, we introduce a novel Multiscale Video Transformer Network (MVTN) for dynamic hand gesture recognition, since multiscale features can extract features with variable size, pose, and shape of hand which is a challenge in hand gesture recognition. The proposed model incorporates a multiscale feature hierarchy to capture diverse levels of detail and context within hand gestures which enhances the model's ability. This multiscale hierarchy is obtained by extracting different dimensions of attention in different transformer stages with initial stages to model high-resolution features and later stages to model low-resolution features. Our approach also leverages multimodal data, utilizing depth maps, infrared data, and surface normals along with RGB images from NVGesture and Briareo datasets. Experiments show that the proposed MVTN achieves state-of-the-art results with less computational complexity and parameters. The source code is available at https://github.com/mallikagarg/MVTN.
Paper Structure (16 sections, 3 equations, 4 figures, 7 tables)

This paper contains 16 sections, 3 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Comparison of different transformer models, where TF-E is Transformer encoder. a) ViT dosovitskiy2020image, which has the same dimension of the attention for various stages of the transformer block (columnar structure). Other variants of ViT have a hierarchical structure. A pyramid of features is learned using: b) a pooling operator, MviT fan2021multiscale, c) a convolution projection, CvT wu2021cvt and d) a linear projection, which is used in our model (MVTN) to progressively shrink the extracted features and take the advantage of scaling.
  • Figure 2: The overall architecture of the proposed Multiscale Video Transformer Network (MVTN) for dynamic hand gesture recognition. Here, the size of the stage block of the transformer is kept varying to show the progressive reduction in the dimension of the attention vector with each stage. Our proposed model captures multiscale contextual information of the hand gesture, which helps to tackle the major challenges of hand shape and size variations.
  • Figure 3: The proposed Multiscale Attention learns a pyramid hierarchy with 6 transformer stages (T1-T6), where $N = B\times T \times D$ is the dimension of attention in the first stage. For convenience, pyramid scaling at each stage with a factor of $N/2$ from the previous stage is shown only for three stages.
  • Figure 4: a) Multi-head attention (MHA) vaswani2017attention, b) Spatial-Reduction Attention (SRA) wang2021pyramid, c) Proposed spatial-reduction attention. We spatially reduce the query (Q), Key (K), and Value (V) using linear projection while in b) only the key and value are reduced. This spatial reduction helps to incorporate a multiscale pyramid structure along with the reduction in the computational cost of the model.