Table of Contents
Fetching ...

Word-level Sign Language Recognition with Multi-stream Neural Networks Focusing on Local Regions and Skeletal Information

Mizuki Maruyama, Shrey Singh, Katsufumi Inoue, Partha Pratim Roy, Masakazu Iwamura, Michifumi Yoshioka

TL;DR

This paper proposes a novel WSLR method that takes into account information specifically useful for the WSLR problem as a multi-stream neural network (MSNN), which consist of three streams: 1) base stream, 2) local image stream, and 3) skeleton stream.

Abstract

Word-level sign language recognition (WSLR) has attracted attention because it is expected to overcome the communication barrier between people with speech impairment and those who can hear. In the WSLR problem, a method designed for action recognition has achieved the state-of-the-art accuracy. Indeed, it sounds reasonable for an action recognition method to perform well on WSLR because sign language is regarded as an action. However, a careful evaluation of the tasks reveals that the tasks of action recognition and WSLR are inherently different. Hence, in this paper, we propose a novel WSLR method that takes into account information specifically useful for the WSLR problem. We realize it as a multi-stream neural network (MSNN), which consist of three streams: 1) base stream, 2) local image stream, and 3) skeleton stream. Each stream is designed to handle different types of information. The base stream deals with quick and detailed movements of the hands and body, the local image stream focuses on handshapes and facial expressions, and the skeleton stream captures the relative positions of the body and both hands. This approach allows us to combine various types of data for more comprehensive gesture analysis. Experimental results on the WLASL and MS-ASL datasets show the effectiveness of the proposed method; it achieved an improvement of approximately 10\%--15\% in Top-1 accuracy when compared with conventional methods.

Word-level Sign Language Recognition with Multi-stream Neural Networks Focusing on Local Regions and Skeletal Information

TL;DR

This paper proposes a novel WSLR method that takes into account information specifically useful for the WSLR problem as a multi-stream neural network (MSNN), which consist of three streams: 1) base stream, 2) local image stream, and 3) skeleton stream.

Abstract

Word-level sign language recognition (WSLR) has attracted attention because it is expected to overcome the communication barrier between people with speech impairment and those who can hear. In the WSLR problem, a method designed for action recognition has achieved the state-of-the-art accuracy. Indeed, it sounds reasonable for an action recognition method to perform well on WSLR because sign language is regarded as an action. However, a careful evaluation of the tasks reveals that the tasks of action recognition and WSLR are inherently different. Hence, in this paper, we propose a novel WSLR method that takes into account information specifically useful for the WSLR problem. We realize it as a multi-stream neural network (MSNN), which consist of three streams: 1) base stream, 2) local image stream, and 3) skeleton stream. Each stream is designed to handle different types of information. The base stream deals with quick and detailed movements of the hands and body, the local image stream focuses on handshapes and facial expressions, and the skeleton stream captures the relative positions of the body and both hands. This approach allows us to combine various types of data for more comprehensive gesture analysis. Experimental results on the WLASL and MS-ASL datasets show the effectiveness of the proposed method; it achieved an improvement of approximately 10\%--15\% in Top-1 accuracy when compared with conventional methods.

Paper Structure

This paper contains 16 sections, 3 equations, 15 figures, 4 tables, 1 algorithm.

Figures (15)

  • Figure 1: Overview of the proposed method. The proposed multi-stream neural network (MSNN) consists of three streams: 1) a base stream, 2) local image stream, and 3) skeleton stream. Each stream is trained separately, and the recognition scores extracted from each stream are averaged to obtain the final recognition result.
  • Figure 2:
  • Figure 3:
  • Figure 5: Example of bounding box extraction for the face and both hands. These bounding boxes are extracted based on the skeletal points detected with OpenPose: both hand regions are based on shoulder point $S$, elbow point $E$, and wrist point $W$, and the face region is based on the left ear point $L$ and right ear point $R$. (The images are provided from the WSASL dataset li2020word)
  • Figure 6: 27 keypoints inputted to ST-GCN in the skeleton stream. Five keypoints refer to the body, and 11 keypoints refer to each hand. (The left image is from vaezi2019ms-asl.)
  • ...and 10 more figures