Linguistics-Vision Monotonic Consistent Network for Sign Language Production
Xu Wang, Shengeng Tang, Peipei Song, Shuo Wang, Dan Guo, Richang Hong
TL;DR
This work tackles cross-modal gaps in Sign Language Production by introducing LVMCN, a Transformer-based framework with two novel modules: Cross-modal Semantic Aligner (CSA) for fine-grained monotonic gloss-to-pose alignment and Multimodal Semantic Comparator (MSC) for coarse-grained semantic coherence. The model jointly optimizes $\mathcal{L}_{acc}$, $\mathcal{L}_{ali}$, and $\mathcal{L}_{com}$, where $\mathcal{L}_{ali}$ uses cosine-similarity based alignment across batches and $\mathcal{L}_{com}$ leverages multimodal triplets to tighten semantic coupling between text and video. Empirical results on PHOENIX14T show state-of-the-art performance across BLEU, ROUGE, WER, DTW-P, FID, and MPJPE, demonstrating improved sign-video realism and linguistic-visual consistency under weak supervision. The approach offers a robust pathway toward more accurate and natural SLP in real-world settings and could influence broader cross-modal alignment research.
Abstract
Sign Language Production (SLP) aims to generate sign videos corresponding to spoken language sentences, where the conversion of sign Glosses to Poses (G2P) is the key step. Due to the cross-modal semantic gap and the lack of word-action correspondence labels for strong supervision alignment, the SLP suffers huge challenges in linguistics-vision consistency. In this work, we propose a Transformer-based Linguistics-Vision Monotonic Consistent Network (LVMCN) for SLP, which constrains fine-grained cross-modal monotonic alignment and coarse-grained multimodal semantic consistency in language-visual cues through Cross-modal Semantic Aligner (CSA) and Multimodal Semantic Comparator (MSC). In the CSA, we constrain the implicit alignment between corresponding gloss and pose sequences by computing the cosine similarity association matrix between cross-modal feature sequences (i.e., the order consistency of fine-grained sign glosses and actions). As for MSC, we construct multimodal triplets based on paired and unpaired samples in batch data. By pulling closer the corresponding text-visual pairs and pushing apart the non-corresponding text-visual pairs, we constrain the semantic co-occurrence degree between corresponding gloss and pose sequences (i.e., the semantic consistency of coarse-grained textual sentences and sign videos). Extensive experiments on the popular PHOENIX14T benchmark show that the LVMCN outperforms the state-of-the-art.
