Table of Contents
Fetching ...

Continuous Sign Language Recognition Based on Motor attention mechanism and frame-level Self-distillation

Qidan Zhu, Jing Li, Fei Yuan, Quan Gan

TL;DR

A novel motor attention mechanism to capture the distorted changes in local motion regions during sign language expression and obtain dynamic representations of image changes is proposed and the self-distillation method is applied to frame-level feature extraction for continuous sign language, which improves the feature expression without increasing the computational resources.

Abstract

Changes in facial expression, head movement, body movement and gesture movement are remarkable cues in sign language recognition, and most of the current continuous sign language recognition(CSLR) research methods mainly focus on static images in video sequences at the frame-level feature extraction stage, while ignoring the dynamic changes in the images. In this paper, we propose a novel motor attention mechanism to capture the distorted changes in local motion regions during sign language expression, and obtain a dynamic representation of image changes. And for the first time, we apply the self-distillation method to frame-level feature extraction for continuous sign language, which improves the feature expression without increasing the computational resources by self-distilling the features of adjacent stages and using the higher-order features as teachers to guide the lower-order features. The combination of the two constitutes our proposed holistic model of CSLR Based on motor attention mechanism and frame-level Self-Distillation (MAM-FSD), which improves the inference ability and robustness of the model. We conduct experiments on three publicly available datasets, and the experimental results show that our proposed method can effectively extract the sign language motion information in videos, improve the accuracy of CSLR and reach the state-of-the-art level.

Continuous Sign Language Recognition Based on Motor attention mechanism and frame-level Self-distillation

TL;DR

A novel motor attention mechanism to capture the distorted changes in local motion regions during sign language expression and obtain dynamic representations of image changes is proposed and the self-distillation method is applied to frame-level feature extraction for continuous sign language, which improves the feature expression without increasing the computational resources.

Abstract

Changes in facial expression, head movement, body movement and gesture movement are remarkable cues in sign language recognition, and most of the current continuous sign language recognition(CSLR) research methods mainly focus on static images in video sequences at the frame-level feature extraction stage, while ignoring the dynamic changes in the images. In this paper, we propose a novel motor attention mechanism to capture the distorted changes in local motion regions during sign language expression, and obtain a dynamic representation of image changes. And for the first time, we apply the self-distillation method to frame-level feature extraction for continuous sign language, which improves the feature expression without increasing the computational resources by self-distilling the features of adjacent stages and using the higher-order features as teachers to guide the lower-order features. The combination of the two constitutes our proposed holistic model of CSLR Based on motor attention mechanism and frame-level Self-Distillation (MAM-FSD), which improves the inference ability and robustness of the model. We conduct experiments on three publicly available datasets, and the experimental results show that our proposed method can effectively extract the sign language motion information in videos, improve the accuracy of CSLR and reach the state-of-the-art level.
Paper Structure (15 sections, 7 equations, 7 figures, 6 tables)

This paper contains 15 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The motion heat map obtained by the inter-frame difference method.
  • Figure 2: Overview of the new MAM-FSD. Firstly, CNN is used to capture frame-level features, followed by 1DCNN+BiLSTM for temporal modeling, and finally, a classifier is used to predict sentences. We place the proposed motor attention mechanism module and frame-level self-distillation method in the frame-level feature extraction section.
  • Figure 3: Structure diagram of motor attention mechanism.
  • Figure 4: WER variation curves for RWTH validation set and test set.
  • Figure 5: WER variation curves for RWTH-T validation set and test set.
  • ...and 2 more figures