SignCLIP: Connecting Text and Sign Language by Contrastive Learning
Zifan Jiang, Gerard Sant, Amit Moryossef, Mathias Müller, Rico Sennrich, Sarah Ebling
TL;DR
SignCLIP adapts CLIP-style contrastive learning to connect spoken language text with sign language videos, enabling a unified multimodal embedding space for cross-language sign interpretation. Starting from FingerCLIP as a proof of concept, the authors pretrain SignCLIP on Spreadthesign, a large multilingual sign-language dictionary, to learn a shared representation without task-specific supervision. The approach yields strong in-domain text-video retrieval and competitive few-shot/isolation SL recognition, with qualitative insights into latent space alignment and sign iconicity across languages. While out-of-domain zero-shot transfer remains challenging, the work demonstrates the viability of large-scale, multilingual sign-language pretraining and highlights a path toward more generalizable sign-language processing through cross-lingual transfer and future scaling.
Abstract
We present SignCLIP, which re-purposes CLIP (Contrastive Language-Image Pretraining) to project spoken language text and sign language videos, two classes of natural languages of distinct modalities, into the same space. SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs, without directly optimizing for a specific task or sign language which is often of limited size. We pretrain SignCLIP on Spreadthesign, a prominent sign language dictionary consisting of ~500 thousand video clips in up to 44 sign languages, and evaluate it with various downstream datasets. SignCLIP discerns in-domain signing with notable text-to-video/video-to-text retrieval accuracy. It also performs competitively for out-of-domain downstream tasks such as isolated sign language recognition upon essential few-shot prompting or fine-tuning. We analyze the latent space formed by the spoken language text and sign language poses, which provides additional linguistic insights. Our code and models are openly available.
