Table of Contents
Fetching ...

Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal

Ronglai Zuo, Brian Mak

TL;DR

This work tackles CSLR under limited data by introducing three auxiliary tasks that enrich backbone representations: SAC guides the visual module to attend informative facial and hand regions using keypoint heatmaps; SEC enforces sentence-level consistency between visual and sequential features via a lightweight SEE and negative sampling; SRM leverages statistics pooling and a gradient reversal to remove signer-specific information for signer-independent CSLR. The approach is implemented in an end-to-end transformer-based backbone (Local Transformer) with a CTC-based alignment module, achieving state-of-the-art or competitive results on five benchmarks (PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, CSL-Daily). Ablation studies validate the complementary benefits of SAC and SEC, and demonstrate the effectiveness of SRM in reducing signer dependency, especially for unseen signers. The results indicate strong practical potential for signer-independent CSLR using RGB input at inference, with robust heatmap-guided attention and cross-modal sentence representations guiding recognition.

Abstract

Most deep-learning-based continuous sign language recognition (CSLR) models share a similar backbone consisting of a visual module, a sequential module, and an alignment module. However, due to limited training samples, a connectionist temporal classification loss may not train such CSLR backbones sufficiently. In this work, we propose three auxiliary tasks to enhance the CSLR backbones. The first task enhances the visual module, which is sensitive to the insufficient training problem, from the perspective of consistency. Specifically, since the information of sign languages is mainly included in signers' facial expressions and hand movements, a keypoint-guided spatial attention module is developed to enforce the visual module to focus on informative regions, i.e., spatial attention consistency. Second, noticing that both the output features of the visual and sequential modules represent the same sentence, to better exploit the backbone's power, a sentence embedding consistency constraint is imposed between the visual and sequential modules to enhance the representation power of both features. We name the CSLR model trained with the above auxiliary tasks as consistency-enhanced CSLR, which performs well on signer-dependent datasets in which all signers appear during both training and testing. To make it more robust for the signer-independent setting, a signer removal module based on feature disentanglement is further proposed to remove signer information from the backbone. Extensive ablation studies are conducted to validate the effectiveness of these auxiliary tasks. More remarkably, with a transformer-based backbone, our model achieves state-of-the-art or competitive performance on five benchmarks, PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily. Code and Models are available at https://github.com/2000ZRL/LCSA_C2SLR_SRM.

Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal

TL;DR

This work tackles CSLR under limited data by introducing three auxiliary tasks that enrich backbone representations: SAC guides the visual module to attend informative facial and hand regions using keypoint heatmaps; SEC enforces sentence-level consistency between visual and sequential features via a lightweight SEE and negative sampling; SRM leverages statistics pooling and a gradient reversal to remove signer-specific information for signer-independent CSLR. The approach is implemented in an end-to-end transformer-based backbone (Local Transformer) with a CTC-based alignment module, achieving state-of-the-art or competitive results on five benchmarks (PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, CSL-Daily). Ablation studies validate the complementary benefits of SAC and SEC, and demonstrate the effectiveness of SRM in reducing signer dependency, especially for unseen signers. The results indicate strong practical potential for signer-independent CSLR using RGB input at inference, with robust heatmap-guided attention and cross-modal sentence representations guiding recognition.

Abstract

Most deep-learning-based continuous sign language recognition (CSLR) models share a similar backbone consisting of a visual module, a sequential module, and an alignment module. However, due to limited training samples, a connectionist temporal classification loss may not train such CSLR backbones sufficiently. In this work, we propose three auxiliary tasks to enhance the CSLR backbones. The first task enhances the visual module, which is sensitive to the insufficient training problem, from the perspective of consistency. Specifically, since the information of sign languages is mainly included in signers' facial expressions and hand movements, a keypoint-guided spatial attention module is developed to enforce the visual module to focus on informative regions, i.e., spatial attention consistency. Second, noticing that both the output features of the visual and sequential modules represent the same sentence, to better exploit the backbone's power, a sentence embedding consistency constraint is imposed between the visual and sequential modules to enhance the representation power of both features. We name the CSLR model trained with the above auxiliary tasks as consistency-enhanced CSLR, which performs well on signer-dependent datasets in which all signers appear during both training and testing. To make it more robust for the signer-independent setting, a signer removal module based on feature disentanglement is further proposed to remove signer information from the backbone. Extensive ablation studies are conducted to validate the effectiveness of these auxiliary tasks. More remarkably, with a transformer-based backbone, our model achieves state-of-the-art or competitive performance on five benchmarks, PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily. Code and Models are available at https://github.com/2000ZRL/LCSA_C2SLR_SRM.
Paper Structure (55 sections, 24 equations, 9 figures, 15 tables)

This paper contains 55 sections, 24 equations, 9 figures, 15 tables.

Figures (9)

  • Figure 1: An overview of the CSLR backbone and the three proposed auxiliary tasks. First, our SAC enforces the visual module to focus on informative regions by leveraging pose keypoints heatmaps. Second, our SEC aligns the visual and sequential features at the sentence level, which can enhance the representation power of both the features simultaneously. SAC and SEC constitute our preliminary work zuo2022c2slr, consistency-enhanced CSLR ($\text{C}^2$SLR). In this work, we extend $\text{C}^2$SLR by developing a novel signer removal module based on feature disentanglement for signer-independent CSLR.
  • Figure 2: An overview of our proposed method. The sign video input is first fed into the visual module (e.g., VGGNet vggnet and ResNet resnet) to extract visual features. The following sequential module (e.g., local Transformer (see details in Section \ref{['sec:lt']}) and TCN) further models long-/short-term dependencies and yield sequential features. The CTC loss ctc is adopted as the main objective function. Three auxiliary tasks (highlighted in different colors) are proposed to improve the performance of the CSLR backbone. For spatial attention consistency, we insert a keypoint-guided spatial attention module after the $m$-th convolution layer, $C_m$, of the visual module. Besides, we push the model to align visual and sequential features at the sentence level to enhance their representative power. Finally, we introduce a signer removal module to make the model more robust to signer discrepancy under the signer-independent setting.
  • Figure 3: (a) The architecture of our spatial attention module. ($J\times K\times C$: the size of the input feature maps, GAP: global average pooling, CMP: channel-wise max pooling. (b) Two examples of the original and refined heatmaps.
  • Figure 4: The workflow of sentence embedding extraction. We omit LayerNorm layernorm for simplicity.
  • Figure 5: Workflow of our signer removal module (SRM). We insert the SRM after the $m$-th CNN layer, $C_m$. The loss of $\text{C}^2$SLR, $\mathcal{L}_b$, which is a sum of the CTC, SAC, and SEC losses, is used to train the backbone parameters $\theta_b$. The signer classification loss $\mathcal{L}_{srm}$ is used to train the SRM parameters $\theta_s$ as usual, while the gradient from $\mathcal{L}_{srm}$ will be reversed for $\theta_b$. $\lambda$ is the loss weight for $\mathcal{L}_{srm}$.
  • ...and 4 more figures