SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning
Hao Chen, Jiaze Wang, Ziyu Guo, Jinpeng Li, Donghao Zhou, Bian Wu, Chenyong Guan, Guangyong Chen, Pheng-Ann Heng
TL;DR
This work addresses the data scarcity and weak supervision in continuous sign language recognition by introducing SignVTCL, a multi-modal framework that fuses video, keypoints, and optical flow while leveraging visual-textual contrastive learning. A frozen text encoder (mBART) and lightweight adapters (V2T and T2V) enable alignment of visual and textual features at gloss and sentence levels, guided by DTW-based gloss mapping and global sentence-level alignment. The approach is trained with a combination of CTC losses, auxiliary SPN losses, and alignment losses, without pre-training, and achieves state-of-the-art results on Phoenix-2014, Phoenix-2014T, and CSL-Daily, with clear ablation support for each component. The method demonstrates strong cross-modal generalization and offers a viable path toward stronger, language-informed visual representations for sign language and broader multi-modal video understanding tasks.
Abstract
Sign language recognition (SLR) plays a vital role in facilitating communication for the hearing-impaired community. SLR is a weakly supervised task where entire videos are annotated with glosses, making it challenging to identify the corresponding gloss within a video segment. Recent studies indicate that the main bottleneck in SLR is the insufficient training caused by the limited availability of large-scale datasets. To address this challenge, we present SignVTCL, a multi-modal continuous sign language recognition framework enhanced by visual-textual contrastive learning, which leverages the full potential of multi-modal data and the generalization ability of language model. SignVTCL integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone, thereby yielding more robust visual representations. Furthermore, SignVTCL contains a visual-textual alignment approach incorporating gloss-level and sentence-level alignment to ensure precise correspondence between visual features and glosses at the level of individual glosses and sentence. Experimental results conducted on three datasets, Phoenix-2014, Phoenix-2014T, and CSL-Daily, demonstrate that SignVTCL achieves state-of-the-art results compared with previous methods.
