Table of Contents
Fetching ...

Dynamic LIBRAS Gesture Recognition via CNN over Spatiotemporal Matrix Representation

Jasmine Moreira

Abstract

This paper proposes a method for dynamic hand gesture recognition based on the composition of two models: the MediaPipe Hand Landmarker, responsible for extracting 21 skeletal keypoints of the hand, and a convolutional neural network (CNN) trained to classify gestures from a spatiotemporal matrix representation of dimensions 90 by 21 of those keypoints. The method is applied to the recognition of LIBRAS (Brazilian Sign Language) gestures for device control in a home automation system, covering 11 classes of static and dynamic gestures. For real-time inference, a sliding window with temporal frame triplication is used, enabling continuous recognition without recurrent networks. Tests achieved 95\% accuracy under low-light conditions and 92\% under normal lighting. The results indicate that the approach is effective, although systematic experiments with greater user diversity are needed for a more thorough evaluation of generalization.

Dynamic LIBRAS Gesture Recognition via CNN over Spatiotemporal Matrix Representation

Abstract

This paper proposes a method for dynamic hand gesture recognition based on the composition of two models: the MediaPipe Hand Landmarker, responsible for extracting 21 skeletal keypoints of the hand, and a convolutional neural network (CNN) trained to classify gestures from a spatiotemporal matrix representation of dimensions 90 by 21 of those keypoints. The method is applied to the recognition of LIBRAS (Brazilian Sign Language) gestures for device control in a home automation system, covering 11 classes of static and dynamic gestures. For real-time inference, a sliding window with temporal frame triplication is used, enabling continuous recognition without recurrent networks. Tests achieved 95\% accuracy under low-light conditions and 92\% under normal lighting. The results indicate that the approach is effective, although systematic experiments with greater user diversity are needed for a more thorough evaluation of generalization.

Paper Structure

This paper contains 10 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: LIBRAS Fingerspelling Alphabet
  • Figure 2: MediaPipe Keypoint Scheme
  • Figure 3: Static Gesture — Letter A
  • Figure 4: Motion Matrices — "A" and Toggle (On/Off)
  • Figure 5: Second Model Architecture — Convolutional Network
  • ...and 5 more figures