Semi-Supervised Spoken Language Glossification

Huijie Yao; Wengang Zhou; Hao Zhou; Houqiang Li

Semi-Supervised Spoken Language Glossification

Huijie Yao, Wengang Zhou, Hao Zhou, Houqiang Li

TL;DR

This work addresses the bottleneck of limited parallel data in spoken language glossification (SLG) by proposing S^3LG, a semi-supervised framework that leverages large-scale monolingual spoken language text. It combines rule-based and model-based auto-annotation to generate complementary pseudo glosses, uses a two-stage training regime with consistency regularization, and employs a tagging mechanism to distinguish data sources across $K$ iterative cycles. Empirical results on PHOENIX14T and CSL-Daily show notable BLEU-4 and CHRF gains over strong baselines, with ablations confirming the benefits of each component, including data mixing, regularization, and iterative refinement. The approach demonstrates the practical potential of unlabeled data to boost SLG and provides a scalable path toward improved sign-language transcription, while acknowledging limitations in extreme low-resource scenarios and the need for broader sign-language coverage.

Abstract

Spoken language glossification (SLG) aims to translate the spoken language text into the sign language gloss, i.e., a written record of sign language. In this work, we present a framework named $S$emi-$S$upervised $S$poken $L$anguage $G$lossification ($S^3$LG) for SLG. To tackle the bottleneck of limited parallel data in SLG, our $S^3$LG incorporates large-scale monolingual spoken language text into SLG training. The proposed framework follows the self-training structure that iteratively annotates and learns from pseudo labels. Considering the lexical similarity and syntactic difference between sign language and spoken language, our $S^3$LG adopts both the rule-based heuristic and model-based approach for auto-annotation. During training, we randomly mix these complementary synthetic datasets and mark their differences with a special token. As the synthetic data may be less quality, the $S^3$LG further leverages consistency regularization to reduce the negative impact of noise in the synthetic data. Extensive experiments are conducted on public benchmarks to demonstrate the effectiveness of the $S^3$LG. Our code is available at \url{https://github.com/yaohj11/S3LG}.

Semi-Supervised Spoken Language Glossification

TL;DR

iterative cycles. Empirical results on PHOENIX14T and CSL-Daily show notable BLEU-4 and CHRF gains over strong baselines, with ablations confirming the benefits of each component, including data mixing, regularization, and iterative refinement. The approach demonstrates the practical potential of unlabeled data to boost SLG and provides a scalable path toward improved sign-language transcription, while acknowledging limitations in extreme low-resource scenarios and the need for broader sign-language coverage.

Abstract

Spoken language glossification (SLG) aims to translate the spoken language text into the sign language gloss, i.e., a written record of sign language. In this work, we present a framework named

emi-

upervised

poken

anguage

lossification (

LG) for SLG. To tackle the bottleneck of limited parallel data in SLG, our

LG incorporates large-scale monolingual spoken language text into SLG training. The proposed framework follows the self-training structure that iteratively annotates and learns from pseudo labels. Considering the lexical similarity and syntactic difference between sign language and spoken language, our

LG adopts both the rule-based heuristic and model-based approach for auto-annotation. During training, we randomly mix these complementary synthetic datasets and mark their differences with a special token. As the synthetic data may be less quality, the

LG further leverages consistency regularization to reduce the negative impact of noise in the synthetic data. Extensive experiments are conducted on public benchmarks to demonstrate the effectiveness of the

LG. Our code is available at \url{https://github.com/yaohj11/S3LG}.

Paper Structure (19 sections, 7 equations, 2 figures, 13 tables)

This paper contains 19 sections, 7 equations, 2 figures, 13 tables.

Introduction
Related Work
Methodology
Overview
Annotating Monolingual Data
Two-Stage Training
Pseudo-Glosses Sampling
Training Objective
Stage-One and Stage-Two
Experiments
Experimental Setup
Comparison with State-of-the-Art Methods
Ablation Study
Conclusion
Rules Used in the Rule-based Heuristic for Creating Synthetic Data $\mathcal{D}_{U,r}$
...and 4 more sections

Figures (2)

Figure 1: Overview of the proposed $S^3$LG. It iteratively annotates and learns from pseudo labels to obtain the final SLG model $f(\hat{\theta}^K)$ after total $K$ iterations. For the $k$-th iteration, the synthetic data $\mathcal{D}_U^{k-1}$ is conducted by randomly mixing the two complementary pseudo glosses generated by the fixed rules and the previously obtained SLG model $f(\hat{\theta}^{k-1})$, respectively. Note that at the first iteration, only the rule-based synthetic data is available.
Figure 2: Illustration of the training process in the $k$-th iteration.

Semi-Supervised Spoken Language Glossification

TL;DR

Abstract

Semi-Supervised Spoken Language Glossification

Authors

TL;DR

Abstract

Table of Contents

Figures (2)