Table of Contents
Fetching ...

SignMouth: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion

Wenfang Wu, Tingting Yuan, Yupeng Li, Daling Wang, Xiaoming Fu

TL;DR

This work addresses sign language translation (SLT) by incorporating mouthing cues, a previously underutilized non-manual signal, to resolve ambiguities in gloss-free translation. SignMouth uses a dual-stream encoder (gesture and mouthing) with gated fusion, followed by temporal modeling and a Flan-T5-based decoder fine-tuned with LoRA. It introduces hierarchical contrastive objectives, including $\mathcal{L}_{vt}$ for visual-text alignment and $\mathcal{L}_{sm}$ for gesture-mouthing alignment, to strengthen cross-modal representations. Evaluations on PHOENIX14T and How2Sign show state-of-the-art performance in gloss-free SLT, with notable BLEU-4 gains (e.g., +0.39 on PHOENIX14T and +0.64 on How2Sign) and improved sentence fluency and disambiguation. Overall, SignMouth demonstrates the practical impact of non-manual cues for more accurate and fluent sign-to-text translation, especially in open-domain contexts.

Abstract

Sign language translation (SLT) aims to translate natural language from sign language videos, serving as a vital bridge for inclusive communication. While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38.

SignMouth: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion

TL;DR

This work addresses sign language translation (SLT) by incorporating mouthing cues, a previously underutilized non-manual signal, to resolve ambiguities in gloss-free translation. SignMouth uses a dual-stream encoder (gesture and mouthing) with gated fusion, followed by temporal modeling and a Flan-T5-based decoder fine-tuned with LoRA. It introduces hierarchical contrastive objectives, including for visual-text alignment and for gesture-mouthing alignment, to strengthen cross-modal representations. Evaluations on PHOENIX14T and How2Sign show state-of-the-art performance in gloss-free SLT, with notable BLEU-4 gains (e.g., +0.39 on PHOENIX14T and +0.64 on How2Sign) and improved sentence fluency and disambiguation. Overall, SignMouth demonstrates the practical impact of non-manual cues for more accurate and fluent sign-to-text translation, especially in open-domain contexts.

Abstract

Sign language translation (SLT) aims to translate natural language from sign language videos, serving as a vital bridge for inclusive communication. While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38.

Paper Structure

This paper contains 31 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The sample of the similar signs in SLT.
  • Figure 2: The SignMouth framework consists of four main components: (1) a spatial encoder (SE) based on Vision Transformer (ViT) to capture spatial representations of hand gestures; (2) a lip-reading encoder (LE) to model mouthing shape dynamics that complement gesture semantics; (3) a multimodal contrastive fusion module that introduces a hierarchical contrastive learning strategy to align and integrate gesture, mouthing, and textual features; and (4) a LLM that takes the fused visual features along with language-instructive prompts and performs sign-to-text translation, fine-tuned with Low-Rank Adaptation (LoRA). C represents concatenate; G is gate mechanism; P is mean pooling.
  • Figure 3: The sample of detected face region.
  • Figure 4: The t-SNE visualization of fused feature.