Table of Contents
Fetching ...

SignCLIP: Connecting Text and Sign Language by Contrastive Learning

Zifan Jiang, Gerard Sant, Amit Moryossef, Mathias Müller, Rico Sennrich, Sarah Ebling

TL;DR

SignCLIP adapts CLIP-style contrastive learning to connect spoken language text with sign language videos, enabling a unified multimodal embedding space for cross-language sign interpretation. Starting from FingerCLIP as a proof of concept, the authors pretrain SignCLIP on Spreadthesign, a large multilingual sign-language dictionary, to learn a shared representation without task-specific supervision. The approach yields strong in-domain text-video retrieval and competitive few-shot/isolation SL recognition, with qualitative insights into latent space alignment and sign iconicity across languages. While out-of-domain zero-shot transfer remains challenging, the work demonstrates the viability of large-scale, multilingual sign-language pretraining and highlights a path toward more generalizable sign-language processing through cross-lingual transfer and future scaling.

Abstract

We present SignCLIP, which re-purposes CLIP (Contrastive Language-Image Pretraining) to project spoken language text and sign language videos, two classes of natural languages of distinct modalities, into the same space. SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs, without directly optimizing for a specific task or sign language which is often of limited size. We pretrain SignCLIP on Spreadthesign, a prominent sign language dictionary consisting of ~500 thousand video clips in up to 44 sign languages, and evaluate it with various downstream datasets. SignCLIP discerns in-domain signing with notable text-to-video/video-to-text retrieval accuracy. It also performs competitively for out-of-domain downstream tasks such as isolated sign language recognition upon essential few-shot prompting or fine-tuning. We analyze the latent space formed by the spoken language text and sign language poses, which provides additional linguistic insights. Our code and models are openly available.

SignCLIP: Connecting Text and Sign Language by Contrastive Learning

TL;DR

SignCLIP adapts CLIP-style contrastive learning to connect spoken language text with sign language videos, enabling a unified multimodal embedding space for cross-language sign interpretation. Starting from FingerCLIP as a proof of concept, the authors pretrain SignCLIP on Spreadthesign, a large multilingual sign-language dictionary, to learn a shared representation without task-specific supervision. The approach yields strong in-domain text-video retrieval and competitive few-shot/isolation SL recognition, with qualitative insights into latent space alignment and sign iconicity across languages. While out-of-domain zero-shot transfer remains challenging, the work demonstrates the viability of large-scale, multilingual sign-language pretraining and highlights a path toward more generalizable sign-language processing through cross-lingual transfer and future scaling.

Abstract

We present SignCLIP, which re-purposes CLIP (Contrastive Language-Image Pretraining) to project spoken language text and sign language videos, two classes of natural languages of distinct modalities, into the same space. SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs, without directly optimizing for a specific task or sign language which is often of limited size. We pretrain SignCLIP on Spreadthesign, a prominent sign language dictionary consisting of ~500 thousand video clips in up to 44 sign languages, and evaluate it with various downstream datasets. SignCLIP discerns in-domain signing with notable text-to-video/video-to-text retrieval accuracy. It also performs competitively for out-of-domain downstream tasks such as isolated sign language recognition upon essential few-shot prompting or fine-tuning. We analyze the latent space formed by the spoken language text and sign language poses, which provides additional linguistic insights. Our code and models are openly available.
Paper Structure (31 sections, 2 equations, 7 figures, 5 tables)

This paper contains 31 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of SignCLIP, comprising a text encoder and a video encoder jointly trained on pairs of text and multilingual signing examples. Every sign is articulated in diverse languages and contexts with subtle differences in hand shape, movement, place of articulation, etc. The screenshots of the videos are from Spreadthesign and the matrix part is taken from CLIP.
  • Figure 2: King – Man + Woman = Queen analogy revisited. 14 video examples of each sign are randomly sampled from the ASL Citizen dataset, embedded by a fine-tuned SignCLIP pose encoder, and then visualized by t-SNE (perplexity=15) with different shapes and colors. Cluster centers are represented with a big symbol.
  • Figure 3: Screenshot of prompting ChatGPT 4o to sign "house" in ASL, which lacks sign language knowledge and tries to sketch a picture of a house on the open palm, tested in June 2024.
  • Figure 4: Examples of the German finger-alphabet taken from the RWTH gesture database recorded with the webcam showing the letters A-Z, Ä, Ö, Ü, SCH, and the numbers 1 to 5. Note that J, Z, Ä, Ö, and Ü are dynamic gestures. Figure taken from https://www-i6.informatik.rwth-aachen.de/aslr/fingerspelling.php.
  • Figure 5: Sign language distribution of video examples in Spreadthesign, using the ISO 639-3 language codes.
  • ...and 2 more figures