Table of Contents
Fetching ...

KD-MSLRT: Lightweight Sign Language Recognition Model Based on Mediapipe and 3D to 1D Knowledge Distillation

Yulong Li, Bolin Ren, Ke Hu, Changyuan Liu, Zhengyong Jiang, Kang Dang, Jionglong Su

TL;DR

A cross-modal multi-knowledge distillation technique from 3D to 1D and a novel end-to-end pre-training text correction framework that achieves significant advancements in correcting text output errors to address the lack of research on data augmentation for landmark data.

Abstract

Artificial intelligence has achieved notable results in sign language recognition and translation. However, relatively few efforts have been made to significantly improve the quality of life for the 72 million hearing-impaired people worldwide. Sign language translation models, relying on video inputs, involves with large parameter sizes, making it time-consuming and computationally intensive to be deployed. This directly contributes to the scarcity of human-centered technology in this field. Additionally, the lack of datasets in sign language translation hampers research progress in this area. To address these, we first propose a cross-modal multi-knowledge distillation technique from 3D to 1D and a novel end-to-end pre-training text correction framework. Compared to other pre-trained models, our framework achieves significant advancements in correcting text output errors. Our model achieves a decrease in Word Error Rate (WER) of at least 1.4% on PHOENIX14 and PHOENIX14T datasets compared to the state-of-the-art CorrNet. Additionally, the TensorFlow Lite (TFLite) quantized model size is reduced to 12.93 MB, making it the smallest, fastest, and most accurate model to date. We have also collected and released extensive Chinese sign language datasets, and developed a specialized training vocabulary. To address the lack of research on data augmentation for landmark data, we have designed comparative experiments on various augmentation methods. Moreover, we performed a simulated deployment and prediction of our model on Intel platform CPUs and assessed the feasibility of deploying the model on other platforms.

KD-MSLRT: Lightweight Sign Language Recognition Model Based on Mediapipe and 3D to 1D Knowledge Distillation

TL;DR

A cross-modal multi-knowledge distillation technique from 3D to 1D and a novel end-to-end pre-training text correction framework that achieves significant advancements in correcting text output errors to address the lack of research on data augmentation for landmark data.

Abstract

Artificial intelligence has achieved notable results in sign language recognition and translation. However, relatively few efforts have been made to significantly improve the quality of life for the 72 million hearing-impaired people worldwide. Sign language translation models, relying on video inputs, involves with large parameter sizes, making it time-consuming and computationally intensive to be deployed. This directly contributes to the scarcity of human-centered technology in this field. Additionally, the lack of datasets in sign language translation hampers research progress in this area. To address these, we first propose a cross-modal multi-knowledge distillation technique from 3D to 1D and a novel end-to-end pre-training text correction framework. Compared to other pre-trained models, our framework achieves significant advancements in correcting text output errors. Our model achieves a decrease in Word Error Rate (WER) of at least 1.4% on PHOENIX14 and PHOENIX14T datasets compared to the state-of-the-art CorrNet. Additionally, the TensorFlow Lite (TFLite) quantized model size is reduced to 12.93 MB, making it the smallest, fastest, and most accurate model to date. We have also collected and released extensive Chinese sign language datasets, and developed a specialized training vocabulary. To address the lack of research on data augmentation for landmark data, we have designed comparative experiments on various augmentation methods. Moreover, we performed a simulated deployment and prediction of our model on Intel platform CPUs and assessed the feasibility of deploying the model on other platforms.
Paper Structure (23 sections, 6 equations, 5 figures, 7 tables)

This paper contains 23 sections, 6 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Publication time versus model performance. Larger circles indicate higher FLOPs, reflecting greater computational complexity.
  • Figure 2: An overview of our KD-MSLRT. We adopt the SOTA video-based sign language model CorrNet 12 as the teacher network. We conduct knowledge distillation separately on the probabilities obtained from the convolutional layers and bidirectional LSTM. The probabilities from the bidirectional LSTM are used to predict sentences, followed by further refinement of the predicted sentences using a text correction network trained on a large corpus.
  • Figure 3: The lightweight sign language recognition model MSLR proposed based on MediaPipe in this research.
  • Figure 4: The self-supervised training text correction model proposed based on the error types of the SLR model outputs in this research.
  • Figure 5: The Chinese long sentence sign language dataset collected for this release consists entirely of news content, including a rich array of professional terms.