Table of Contents
Fetching ...

Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining

Neena Aloysius, Geetha M, Prema Nedungadi

TL;DR

This work tackles continuous sign language recognition (CSLR) by adapting the Conformer architecture to vision-based CSLR, introducing ConSignformer—a three-stream ensemble (RGB Encoder, Heatmap Encoder, Fusion Network) enhanced with Cross-Modal Relative Attention (CMRA) and Attentional Pyramid Networks (APN). An unsupervised pretraining regime, called Regressional Feature Extraction, pretrains the Conformer on sign-language pose data to provide informative priors for downstream CSLR. The approach demonstrates state-of-the-art Word Error Rate (WER) on the German PHOENIX-2014 and PHOENIX-2014T datasets, attributed to improved context learning and cross-modal fusion. Despite gains, the method incurs high computational cost, motivating future research toward efficient, production-ready models that maintain accuracy while enabling broader accessibility for sign-language technologies.

Abstract

Conventional Deep Learning frameworks for continuous sign language recognition (CSLR) are comprised of a single or multi-modal feature extractor, a sequence-learning module, and a decoder for outputting the glosses. The sequence learning module is a crucial part wherein transformers have demonstrated their efficacy in the sequence-to-sequence tasks. Analyzing the research progress in the field of Natural Language Processing and Speech Recognition, a rapid introduction of various transformer variants is observed. However, in the realm of sign language, experimentation in the sequence learning component is limited. In this work, the state-of-the-art Conformer model for Speech Recognition is adapted for CSLR and the proposed model is termed ConSignformer. This marks the first instance of employing Conformer for a vision-based task. ConSignformer has bimodal pipeline of CNN as feature extractor and Conformer for sequence learning. For improved context learning we also introduce Cross-Modal Relative Attention (CMRA). By incorporating CMRA into the model, it becomes more adept at learning and utilizing complex relationships within the data. To further enhance the Conformer model, unsupervised pretraining called Regressional Feature Extraction is conducted on a curated sign language dataset. The pretrained Conformer is then fine-tuned for the downstream recognition task. The experimental results confirm the effectiveness of the adopted pretraining strategy and demonstrate how CMRA contributes to the recognition process. Remarkably, leveraging a Conformer-based backbone, our model achieves state-of-the-art performance on the benchmark datasets: PHOENIX-2014 and PHOENIX-2014T.

Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining

TL;DR

This work tackles continuous sign language recognition (CSLR) by adapting the Conformer architecture to vision-based CSLR, introducing ConSignformer—a three-stream ensemble (RGB Encoder, Heatmap Encoder, Fusion Network) enhanced with Cross-Modal Relative Attention (CMRA) and Attentional Pyramid Networks (APN). An unsupervised pretraining regime, called Regressional Feature Extraction, pretrains the Conformer on sign-language pose data to provide informative priors for downstream CSLR. The approach demonstrates state-of-the-art Word Error Rate (WER) on the German PHOENIX-2014 and PHOENIX-2014T datasets, attributed to improved context learning and cross-modal fusion. Despite gains, the method incurs high computational cost, motivating future research toward efficient, production-ready models that maintain accuracy while enabling broader accessibility for sign-language technologies.

Abstract

Conventional Deep Learning frameworks for continuous sign language recognition (CSLR) are comprised of a single or multi-modal feature extractor, a sequence-learning module, and a decoder for outputting the glosses. The sequence learning module is a crucial part wherein transformers have demonstrated their efficacy in the sequence-to-sequence tasks. Analyzing the research progress in the field of Natural Language Processing and Speech Recognition, a rapid introduction of various transformer variants is observed. However, in the realm of sign language, experimentation in the sequence learning component is limited. In this work, the state-of-the-art Conformer model for Speech Recognition is adapted for CSLR and the proposed model is termed ConSignformer. This marks the first instance of employing Conformer for a vision-based task. ConSignformer has bimodal pipeline of CNN as feature extractor and Conformer for sequence learning. For improved context learning we also introduce Cross-Modal Relative Attention (CMRA). By incorporating CMRA into the model, it becomes more adept at learning and utilizing complex relationships within the data. To further enhance the Conformer model, unsupervised pretraining called Regressional Feature Extraction is conducted on a curated sign language dataset. The pretrained Conformer is then fine-tuned for the downstream recognition task. The experimental results confirm the effectiveness of the adopted pretraining strategy and demonstrate how CMRA contributes to the recognition process. Remarkably, leveraging a Conformer-based backbone, our model achieves state-of-the-art performance on the benchmark datasets: PHOENIX-2014 and PHOENIX-2014T.
Paper Structure (20 sections, 6 equations, 7 figures, 4 tables)

This paper contains 20 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Joint estimates of (a)mediapipe pose and the euclidean distance between joints shown in blue lines (b) mediapipe hands. [Adapated from mediapipe]
  • Figure 2: Unsupervised Pretraining - Regressional Feature Extraction
  • Figure 3: RGB Encoder - when input is RGB videos; Heatmap Encoder - when input is heatmap videos. Heatmaps are the representations of the keypoints of the face, hands, and upper body, extracted using HRNet jin2020whole trained on COCO-WholeBody wang2020deep.
  • Figure 4: Attentional Pyramid Network. $Feature\ maps\_i$ is the intermediate representations of S3D. The left side shows the Top-Down flow and on the right side is the Bottom-Up Flow with lateral connections to result in a Parallel Flow of features. Attention weights are denoted by $\alpha_i$ and $\beta_i$
  • Figure 5: ConSignformer - Ensemble of RGB Encoder, Heatmap Encoder and Fusion Network, supplemented by Attentional Pyramid Networks.
  • ...and 2 more figures