Table of Contents
Fetching ...

Hierarchical Windowed Graph Attention Network and a Large Scale Dataset for Isolated Indian Sign Language Recognition

Suvajit Patra, Arkadip Maitra, Megha Tiwari, K. Kumaran, Swathy Prabhu, Swami Punyeshwarananda, Soumitra Samanta

TL;DR

A large-scale isolated ISL dataset and a novel SL recognition model based on skeleton graph structure namely Hierarchical Windowed Graph Attention Network (HWGAT) by utilizing the human upper body skeleton graph are introduced.

Abstract

Automatic Sign Language (SL) recognition is an important task in the computer vision community. To build a robust SL recognition system, we need a considerable amount of data which is lacking particularly in Indian sign language (ISL). In this paper, we introduce a large-scale isolated ISL dataset and a novel SL recognition model based on skeleton graph structure. The dataset covers 2002 daily used common words in the deaf community recorded by 20 (10 male and 10 female) deaf adult signers (contains 40033 videos). We propose a SL recognition model namely Hierarchical Windowed Graph Attention Network (HWGAT) by utilizing the human upper body skeleton graph. The HWGAT tries to capture distinctive motions by giving attention to different body parts induced by the human skeleton graph. The utility of the proposed dataset and the usefulness of our model are evaluated through extensive experiments. We pre-trained the proposed model on the presented dataset and fine-tuned it across different sign language datasets further boosting the performance of 1.10, 0.46, 0.78, and 6.84 percentage points on INCLUDE, LSA64, AUTSL and WLASL respectively compared to the existing state-of-the-art keypoints-based models.

Hierarchical Windowed Graph Attention Network and a Large Scale Dataset for Isolated Indian Sign Language Recognition

TL;DR

A large-scale isolated ISL dataset and a novel SL recognition model based on skeleton graph structure namely Hierarchical Windowed Graph Attention Network (HWGAT) by utilizing the human upper body skeleton graph are introduced.

Abstract

Automatic Sign Language (SL) recognition is an important task in the computer vision community. To build a robust SL recognition system, we need a considerable amount of data which is lacking particularly in Indian sign language (ISL). In this paper, we introduce a large-scale isolated ISL dataset and a novel SL recognition model based on skeleton graph structure. The dataset covers 2002 daily used common words in the deaf community recorded by 20 (10 male and 10 female) deaf adult signers (contains 40033 videos). We propose a SL recognition model namely Hierarchical Windowed Graph Attention Network (HWGAT) by utilizing the human upper body skeleton graph. The HWGAT tries to capture distinctive motions by giving attention to different body parts induced by the human skeleton graph. The utility of the proposed dataset and the usefulness of our model are evaluated through extensive experiments. We pre-trained the proposed model on the presented dataset and fine-tuned it across different sign language datasets further boosting the performance of 1.10, 0.46, 0.78, and 6.84 percentage points on INCLUDE, LSA64, AUTSL and WLASL respectively compared to the existing state-of-the-art keypoints-based models.
Paper Structure (17 sections, 2 equations, 9 figures, 11 tables)

This paper contains 17 sections, 2 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: A sample frame of the sign Hello in all the views and modalities ((a) left (60fps), (b) front (60fps), (c) right (60fps), (d) Azure Kinect DK depth (30fps) and (e) Azure Kinect DK RGB (30fps)) available in the dataset.
  • Figure 2: Sample frames from the signs Hello and World.
  • Figure 3: The proposed Hierarchical Windowed Graph Attention Network (HWGAT) takes the spatio-temporal graph structure as input and divides this graph into multiple spatial windows based on distinct body parts as represented in Figure \ref{['fig:kp_parts']}. Next, multiple part attention layers are applied on this windowed graph structure to extract features and a fully connected layer is used to get the sign word.
  • Figure 4: Grouping of keypoints according to the $5$ body parts $P_1$ to $P_5$. $P_1$ contains the right-hand keypoints, $P_2$ contains the right arm keypoints, $P_3$ contains the facial keypoints, $P_4$ corresponds to the left arm and $P_5$ that of the left hand keypoints. The part combinations are used to create the $4$ spatial windows.
  • Figure 5: Visual representation of the spatial graph using 27 keypoints (10 per hand and 7 pose points).
  • ...and 4 more figures