Enhancing Brazilian Sign Language Recognition through Skeleton Image Representation

Carlos Eduardo G. R. Alves; Francisco de Assis Boldt; Thiago M. Paixão

Enhancing Brazilian Sign Language Recognition through Skeleton Image Representation

Carlos Eduardo G. R. Alves, Francisco de Assis Boldt, Thiago M. Paixão

TL;DR

This work tackles Isolated Sign Language Recognition (ISLR) for LIBRAS by transforming frame-level body, hand, and face landmarks into a single skeleton image, then classifying with a 2-D CNN. The method leverages OpenPose for landmark extraction and Skeleton-DML for image encoding, relying on RGB input and a lightweight ResNet18-based architecture. It achieves state-of-the-art results on MINDS-Libras ($0.93$) and LIBRAS-UFOP ($0.82$) while offering improved trainability and simplicity compared to multimodal 3-D CNN baselines. A key limitation is the landmark extraction time via OpenPose, suggesting future work on faster pose estimators and alternative encodings to enable real-time applications.

Abstract

Effective communication is paramount for the inclusion of deaf individuals in society. However, persistent communication barriers due to limited Sign Language (SL) knowledge hinder their full participation. In this context, Sign Language Recognition (SLR) systems have been developed to improve communication between signing and non-signing individuals. In particular, there is the problem of recognizing isolated signs (Isolated Sign Language Recognition, ISLR) of great relevance in the development of vision-based SL search engines, learning tools, and translation systems. This work proposes an ISLR approach where body, hands, and facial landmarks are extracted throughout time and encoded as 2-D images. These images are processed by a convolutional neural network, which maps the visual-temporal information into a sign label. Experimental results demonstrate that our method surpassed the state-of-the-art in terms of performance metrics on two widely recognized datasets in Brazilian Sign Language (LIBRAS), the primary focus of this study. In addition to being more accurate, our method is more time-efficient and easier to train due to its reliance on a simpler network architecture and solely RGB data as input.

Enhancing Brazilian Sign Language Recognition through Skeleton Image Representation

TL;DR

) and LIBRAS-UFOP (

) while offering improved trainability and simplicity compared to multimodal 3-D CNN baselines. A key limitation is the landmark extraction time via OpenPose, suggesting future work on faster pose estimators and alternative encodings to enable real-time applications.

Abstract

Paper Structure (19 sections, 4 figures, 3 tables)

This paper contains 19 sections, 4 figures, 3 tables.

Introduction
Proposed ISLR Method
Landmarks Extraction
Image Encoding
Training the Model
Experimental Methodology
Datasets
MINDS-Libras Dataset
LIBRAS-UFOP Dataset
Performance Metrics
Experimental Procedure
Ablation Study
Comparative Evaluation
Experimental Platform
Results and Discussion
...and 4 more sections

Figures (4)

Figure 1: Overview of the proposed ISLR method. Initially, landmarks are extracted from individual frames of the input video sequence. Then, the landmarks are converted into a single 2-D image that encodes spatial and temporal information. Finally, the image is fed into a CNN model for sign classification.
Figure 2: Extraction of landmarks from raw RGB frames using OpenPose cao2019openpose. All landmarks above the hip (totaling 126) were utilized, including 42 hand landmarks (green dots), 70 face landmarks (yellow dots), and 14 body landmarks (red 'x'). Face points indicated by 'x' markers are treated as body landmarks by OpenPose.
Figure 3: Distribution of video sequence length per sign for the evaluation datasets.
Figure 4: Confusion matrix for the MINDS-Libras dataset.

Enhancing Brazilian Sign Language Recognition through Skeleton Image Representation

TL;DR

Abstract

Enhancing Brazilian Sign Language Recognition through Skeleton Image Representation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)