Table of Contents
Fetching ...

Denoising-Contrastive Alignment for Continuous Sign Language Recognition

Leming Guo, Wanli Xue, Shengyong Chen

TL;DR

This work tackles continuous sign language recognition (CSLR) by introducing Denoising-Contrastive Alignment (DCA), which leverages textual grammar to guide video representations. It combines Contrastive Instance Alignment (CIA) for sign-gloss correspondence with a Denoising-Diffusion Alignment (DDA) that aligns global gloss context through a diffusion-based autoencoder, aided by a gloss encoder informed by a fine-tuned language model. By integrating a mutual projection between video and gloss features and a gradient-modulated optimization, DCA achieves state-of-the-art results on PHOENIX-2014, PHOENIX-2014T, and CSL-Daily, while enhancing global temporal context learning. The approach demonstrates the practical impact of textual-grammar guidance for refining visual representations in CSLR, albeit with higher training-time overhead that invites further efficiency work.

Abstract

Continuous sign language recognition (CSLR) aims to recognize signs in untrimmed sign language videos to textual glosses. A key challenge of CSLR is achieving effective cross-modality alignment between video and gloss sequences to enhance video representation. However, current cross-modality alignment paradigms often neglect the role of textual grammar to guide the video representation in learning global temporal context, which adversely affects recognition performance. To tackle this limitation, we propose a Denoising-Contrastive Alignment (DCA) paradigm. DCA creatively leverages textual grammar to enhance video representations through two complementary approaches: modeling the instance correspondence between signs and glosses from a discrimination perspective and aligning their global context from a generative perspective. Specifically, DCA accomplishes flexible instance-level correspondence between signs and glosses using a contrastive loss. Building on this, DCA models global context alignment between the video and gloss sequences by denoising the gloss representation from noise, guided by video representation. Additionally, DCA introduces gradient modulation to optimize the alignment and recognition gradients, ensuring a more effective learning process. By integrating gloss-wise and global context knowledge, DCA significantly enhances video representations for CSLR tasks. Experimental results across public benchmarks validate the effectiveness of DCA and confirm its video representation enhancement feasibility.

Denoising-Contrastive Alignment for Continuous Sign Language Recognition

TL;DR

This work tackles continuous sign language recognition (CSLR) by introducing Denoising-Contrastive Alignment (DCA), which leverages textual grammar to guide video representations. It combines Contrastive Instance Alignment (CIA) for sign-gloss correspondence with a Denoising-Diffusion Alignment (DDA) that aligns global gloss context through a diffusion-based autoencoder, aided by a gloss encoder informed by a fine-tuned language model. By integrating a mutual projection between video and gloss features and a gradient-modulated optimization, DCA achieves state-of-the-art results on PHOENIX-2014, PHOENIX-2014T, and CSL-Daily, while enhancing global temporal context learning. The approach demonstrates the practical impact of textual-grammar guidance for refining visual representations in CSLR, albeit with higher training-time overhead that invites further efficiency work.

Abstract

Continuous sign language recognition (CSLR) aims to recognize signs in untrimmed sign language videos to textual glosses. A key challenge of CSLR is achieving effective cross-modality alignment between video and gloss sequences to enhance video representation. However, current cross-modality alignment paradigms often neglect the role of textual grammar to guide the video representation in learning global temporal context, which adversely affects recognition performance. To tackle this limitation, we propose a Denoising-Contrastive Alignment (DCA) paradigm. DCA creatively leverages textual grammar to enhance video representations through two complementary approaches: modeling the instance correspondence between signs and glosses from a discrimination perspective and aligning their global context from a generative perspective. Specifically, DCA accomplishes flexible instance-level correspondence between signs and glosses using a contrastive loss. Building on this, DCA models global context alignment between the video and gloss sequences by denoising the gloss representation from noise, guided by video representation. Additionally, DCA introduces gradient modulation to optimize the alignment and recognition gradients, ensuring a more effective learning process. By integrating gloss-wise and global context knowledge, DCA significantly enhances video representations for CSLR tasks. Experimental results across public benchmarks validate the effectiveness of DCA and confirm its video representation enhancement feasibility.
Paper Structure (19 sections, 10 equations, 5 figures, 9 tables)

This paper contains 19 sections, 10 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Cross-modality alignment paradigms investigated in CSLR. (a) Video clip$\to$individual gloss mapping in the gloss space. (b) Two modalities' distribution close in the high-dimensional common latent space. (c) Video clip and glosses of previous time steps mapping in the multi-hybrid spaces. (d) Two modalities' clip-wise features mutual mapping and are projected into a low-dimensional common space.
  • Figure 2: Illustration of the proposed Denoising-Contrastive Alignment (DCA). We begin by using the Contrastive instance Alignment to model the sign-gloss instance correspondence to learn sign semantics. This approach encourages each sign to match with its most semantically relevant gloss, Next, we propose Denoising-Diffusion Alignment to align the global context between video and gloss sequences. This technique guides video sequence representations to reconstruct gloss sequence representations from noise. The learned instance and global gloss semantics then supervise the video encoder, helping refine the global temporal context of video representations. Finally, we introduce gradient modulation to adjust the optimization angle between the alignment gradient and the recognition gradient to avoid optimization conflict.
  • Figure 3: Evaluation for the global temporal context learning over DCA and other SOTA CSLR methods on the PHOENIX-2014 test set.
  • Figure 4: Visualizing the recognition results and the Grad-CAMs gradcam2017 results of TLP hu2022temporal, SEN Hu2022SelfEmphasizingNF, CorrNet hu2023continuous and proposed DCA on a PHOENIX-2014 test video. Glosses with red symbols denote the wrongly predicted gloss. The shades of color of the regions (blue, yellow, red, dark red) represent the weak to strong attention of the model to the sign spatial regions.
  • Figure 5: Evaluation for DCA's generalization and recognition accuracy over other SOTA CSLR methods on the PHOENIX-2014 test set.