Table of Contents
Fetching ...

Scaling up Multimodal Pre-training for Sign Language Understanding

Wengang Zhou, Weichao Zhao, Hezhen Hu, Zecheng Li, Houqiang Li

TL;DR

This work tackles SLU by scaling multimodal pre-training with a million-scale, text-labeled sign pose corpus (SL-1.5M) and a simple yet effective framework that jointly leverages sign pose (manual and non-manual) and textual information. It introduces a multi-task objective combining masked pose reconstruction and fine-grained sign-text contrastive learning, guided by a frozen multilingual text encoder and a pose decoder. The approach yields state-of-the-art results across 12 benchmarks spanning ISLR, CSLR, GF-SLT, and SL-RT, demonstrating strong cross-task generalization without task-specific designs. The dataset and method significantly advance sign-language pre-training, offering a scalable path to more robust and versatile SLU systems.

Abstract

Sign language serves as the primary meaning of communication for the deaf-mute community. Different from spoken language, it commonly conveys information by the collaboration of manual features, i.e., hand gestures and body movements, and non-manual features, i.e., facial expressions and mouth cues. To facilitate communication between the deaf-mute and hearing people, a series of sign language understanding (SLU) tasks have been studied in recent years, including isolated/continuous sign language recognition (ISLR/CSLR), gloss-free sign language translation (GF-SLT) and sign language retrieval (SL-RT). Sign language recognition and translation aims to understand the semantic meaning conveyed by sign languages from gloss-level and sentence-level, respectively. In contrast, SL-RT focuses on retrieving sign videos or corresponding texts from a closed-set under the query-by-example search paradigm. These tasks investigate sign language topics from diverse perspectives and raise challenges in learning effective representation of sign language videos. To advance the development of sign language understanding, exploring a generalized model that is applicable across various SLU tasks is a profound research direction.

Scaling up Multimodal Pre-training for Sign Language Understanding

TL;DR

This work tackles SLU by scaling multimodal pre-training with a million-scale, text-labeled sign pose corpus (SL-1.5M) and a simple yet effective framework that jointly leverages sign pose (manual and non-manual) and textual information. It introduces a multi-task objective combining masked pose reconstruction and fine-grained sign-text contrastive learning, guided by a frozen multilingual text encoder and a pose decoder. The approach yields state-of-the-art results across 12 benchmarks spanning ISLR, CSLR, GF-SLT, and SL-RT, demonstrating strong cross-task generalization without task-specific designs. The dataset and method significantly advance sign-language pre-training, offering a scalable path to more robust and versatile SLU systems.

Abstract

Sign language serves as the primary meaning of communication for the deaf-mute community. Different from spoken language, it commonly conveys information by the collaboration of manual features, i.e., hand gestures and body movements, and non-manual features, i.e., facial expressions and mouth cues. To facilitate communication between the deaf-mute and hearing people, a series of sign language understanding (SLU) tasks have been studied in recent years, including isolated/continuous sign language recognition (ISLR/CSLR), gloss-free sign language translation (GF-SLT) and sign language retrieval (SL-RT). Sign language recognition and translation aims to understand the semantic meaning conveyed by sign languages from gloss-level and sentence-level, respectively. In contrast, SL-RT focuses on retrieving sign videos or corresponding texts from a closed-set under the query-by-example search paradigm. These tasks investigate sign language topics from diverse perspectives and raise challenges in learning effective representation of sign language videos. To advance the development of sign language understanding, exploring a generalized model that is applicable across various SLU tasks is a profound research direction.
Paper Structure (16 sections, 6 equations, 8 figures, 20 tables)

This paper contains 16 sections, 6 equations, 8 figures, 20 tables.

Figures (8)

  • Figure 1: Overview of (a) prior pre-training methods hu2021signberthu2023signbert+zhao2023bestalbanie2020bsl and (b) our proposed method. Due to the limited pre-training data and insufficient information mining, existing approaches suffer from inferior performance and inconsistent generalization in diverse SLU tasks. In contrast, we collect adequate paired sign-text data and further design a novel pretext task to enhance the capability of our framework, achieving consistent improvement in diverse downstream tasks.
  • Figure 2: The key composition of SL-1.5M dataset.
  • Figure 3: The statistic of SL-1.5M dataset.
  • Figure 4: Distribution over sample durations.
  • Figure 5: Illustration of our proposed framework during pre-training. The input is paired sign pose and text data $(V_i, T_i)$. The sign pose encoder extracts different semantic features, i.e, manual and non-manual, from masked pose sequence. The sign pose decoder reconstructs masked joints from incomplete pose data under the supervision of pose reconstruction loss $\mathcal{L}_{PR}$. The corresponding text is fed into a text encoder to extract word-level features. Then, we align the latent space of paired sign-text features through fine-grained similarity calculation. The sign-text contrastive loss $\mathcal{L}_{STC}$ jointly optimizes the pre-training procedure.
  • ...and 3 more figures