Denoising-Contrastive Alignment for Continuous Sign Language Recognition
Leming Guo, Wanli Xue, Shengyong Chen
TL;DR
This work tackles continuous sign language recognition (CSLR) by introducing Denoising-Contrastive Alignment (DCA), which leverages textual grammar to guide video representations. It combines Contrastive Instance Alignment (CIA) for sign-gloss correspondence with a Denoising-Diffusion Alignment (DDA) that aligns global gloss context through a diffusion-based autoencoder, aided by a gloss encoder informed by a fine-tuned language model. By integrating a mutual projection between video and gloss features and a gradient-modulated optimization, DCA achieves state-of-the-art results on PHOENIX-2014, PHOENIX-2014T, and CSL-Daily, while enhancing global temporal context learning. The approach demonstrates the practical impact of textual-grammar guidance for refining visual representations in CSLR, albeit with higher training-time overhead that invites further efficiency work.
Abstract
Continuous sign language recognition (CSLR) aims to recognize signs in untrimmed sign language videos to textual glosses. A key challenge of CSLR is achieving effective cross-modality alignment between video and gloss sequences to enhance video representation. However, current cross-modality alignment paradigms often neglect the role of textual grammar to guide the video representation in learning global temporal context, which adversely affects recognition performance. To tackle this limitation, we propose a Denoising-Contrastive Alignment (DCA) paradigm. DCA creatively leverages textual grammar to enhance video representations through two complementary approaches: modeling the instance correspondence between signs and glosses from a discrimination perspective and aligning their global context from a generative perspective. Specifically, DCA accomplishes flexible instance-level correspondence between signs and glosses using a contrastive loss. Building on this, DCA models global context alignment between the video and gloss sequences by denoising the gloss representation from noise, guided by video representation. Additionally, DCA introduces gradient modulation to optimize the alignment and recognition gradients, ensuring a more effective learning process. By integrating gloss-wise and global context knowledge, DCA significantly enhances video representations for CSLR tasks. Experimental results across public benchmarks validate the effectiveness of DCA and confirm its video representation enhancement feasibility.
