Table of Contents
Fetching ...

SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning

Hao Chen, Jiaze Wang, Ziyu Guo, Jinpeng Li, Donghao Zhou, Bian Wu, Chenyong Guan, Guangyong Chen, Pheng-Ann Heng

TL;DR

This work addresses the data scarcity and weak supervision in continuous sign language recognition by introducing SignVTCL, a multi-modal framework that fuses video, keypoints, and optical flow while leveraging visual-textual contrastive learning. A frozen text encoder (mBART) and lightweight adapters (V2T and T2V) enable alignment of visual and textual features at gloss and sentence levels, guided by DTW-based gloss mapping and global sentence-level alignment. The approach is trained with a combination of CTC losses, auxiliary SPN losses, and alignment losses, without pre-training, and achieves state-of-the-art results on Phoenix-2014, Phoenix-2014T, and CSL-Daily, with clear ablation support for each component. The method demonstrates strong cross-modal generalization and offers a viable path toward stronger, language-informed visual representations for sign language and broader multi-modal video understanding tasks.

Abstract

Sign language recognition (SLR) plays a vital role in facilitating communication for the hearing-impaired community. SLR is a weakly supervised task where entire videos are annotated with glosses, making it challenging to identify the corresponding gloss within a video segment. Recent studies indicate that the main bottleneck in SLR is the insufficient training caused by the limited availability of large-scale datasets. To address this challenge, we present SignVTCL, a multi-modal continuous sign language recognition framework enhanced by visual-textual contrastive learning, which leverages the full potential of multi-modal data and the generalization ability of language model. SignVTCL integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone, thereby yielding more robust visual representations. Furthermore, SignVTCL contains a visual-textual alignment approach incorporating gloss-level and sentence-level alignment to ensure precise correspondence between visual features and glosses at the level of individual glosses and sentence. Experimental results conducted on three datasets, Phoenix-2014, Phoenix-2014T, and CSL-Daily, demonstrate that SignVTCL achieves state-of-the-art results compared with previous methods.

SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning

TL;DR

This work addresses the data scarcity and weak supervision in continuous sign language recognition by introducing SignVTCL, a multi-modal framework that fuses video, keypoints, and optical flow while leveraging visual-textual contrastive learning. A frozen text encoder (mBART) and lightweight adapters (V2T and T2V) enable alignment of visual and textual features at gloss and sentence levels, guided by DTW-based gloss mapping and global sentence-level alignment. The approach is trained with a combination of CTC losses, auxiliary SPN losses, and alignment losses, without pre-training, and achieves state-of-the-art results on Phoenix-2014, Phoenix-2014T, and CSL-Daily, with clear ablation support for each component. The method demonstrates strong cross-modal generalization and offers a viable path toward stronger, language-informed visual representations for sign language and broader multi-modal video understanding tasks.

Abstract

Sign language recognition (SLR) plays a vital role in facilitating communication for the hearing-impaired community. SLR is a weakly supervised task where entire videos are annotated with glosses, making it challenging to identify the corresponding gloss within a video segment. Recent studies indicate that the main bottleneck in SLR is the insufficient training caused by the limited availability of large-scale datasets. To address this challenge, we present SignVTCL, a multi-modal continuous sign language recognition framework enhanced by visual-textual contrastive learning, which leverages the full potential of multi-modal data and the generalization ability of language model. SignVTCL integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone, thereby yielding more robust visual representations. Furthermore, SignVTCL contains a visual-textual alignment approach incorporating gloss-level and sentence-level alignment to ensure precise correspondence between visual features and glosses at the level of individual glosses and sentence. Experimental results conducted on three datasets, Phoenix-2014, Phoenix-2014T, and CSL-Daily, demonstrate that SignVTCL achieves state-of-the-art results compared with previous methods.
Paper Structure (15 sections, 11 equations, 8 figures, 6 tables)

This paper contains 15 sections, 11 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The Overview of SignVTCL. Three modalities of data are used to learn visual representations of sign language. The alignment at both the gloss and sentence level facilitates language-guided visual representation learning for boosting SLR capability.
  • Figure 2: The Pipeline of SignVTCL. The multi-modal visual backbone aims to extract visual representations from three different modalities. These features are then passed through head networks for predicting frame-wise gloss probabilities. Simultaneously, we input labeled glosses into a frozen pretrained text encoder to obtain textual representations. Subsequently, the visual and textual representations are aligned within a joint multi-modal semantic space to supervise the multi-modal visual backbone using two adapters: the V2T adapter and the T2V adapter. During the inference phase, a CTC decoder is employed to generate glosses based on the predicted gloss probabilities.
  • Figure 3: The Architecture of Three-Branch Network. In each branch, the first four blocks of S3D serves as the backbone, providing the foundational architecture. Between each block, a multi-modal fusion module is incorporated to effectively merge information from different modalities.
  • Figure 4: The Architecture of Sign Pyramid Network (SPN). To ensure the generation of meaningful representations, each branch is equipped with a SPN to supervise shallow layers. Transposed convolutions are employed to align the temporal and spatial dimensions of two feature maps, enabling element-wise addition between them.
  • Figure 5: An Example of Finding the Alignment of Gloss and Visual Features. Assuming a vocabulary size of C glosses, each gloss can be represented by an 'ID' starting from 1. It is important to note that each row in the probability matrix should add up to 1. The guiding principle for determining the path with the highest probability is as follows: commencing from the top-left corner of the matrix, movement is restricted to either downward or towards the lower-right position. This constraint is imposed to maintain the correspondence between the order of video frames and labeled glosses.
  • ...and 3 more figures