Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation
Jianyuan Guo, Peike Li, Trevor Cohn
TL;DR
This paper addresses SLT under the constraint of scarce gloss annotations by generating pseudo glosses from spoken-language text with LLMs and refining their temporal order through weakly supervised reordering. It presents a three-stage training pipeline—Sign2Gloss, Gloss2Text, and Sign2Text—leveraging CTC and cross-entropy losses to bridge vision and language, while removing dependence on manual gloss labels. Empirical results on Phoenix14T and How2Sign show that gloss-free PGG-SLT with in-context LLM glossing and reordering can outperform previous gloss-free methods and rival gloss-based approaches, especially with stronger translators like Gemma2. This approach reduces annotation costs, scales to larger datasets, and offers a practical path toward robust SLT across languages and domains by aligning sign and spoken-language representations through pseudo gloss supervision.
Abstract
Sign Language Translation (SLT) aims to map sign language videos to spoken language text. A common approach relies on gloss annotations as an intermediate representation, decomposing SLT into two sub-tasks: video-to-gloss recognition and gloss-to-text translation. While effective, this paradigm depends on expert-annotated gloss labels, which are costly and rarely available in existing datasets, limiting its scalability. To address this challenge, we propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses while preserving the structured intermediate representation. Specifically, we prompt a Large Language Model (LLM) with a few example text-gloss pairs using in-context learning to produce draft sign glosses from spoken language text. To enhance the correspondence between LLM-generated pseudo glosses and the sign sequences in video, we correct the ordering in the pseudo glosses for better alignment via a weakly supervised learning process. This reordering facilitates the incorporation of auxiliary alignment objectives, and allows for the use of efficient supervision via a Connectionist Temporal Classification (CTC) loss. We train our SLT mode, which consists of a vision encoder and a translator, through a three-stage pipeline, which progressively narrows the modality gap between sign language and spoken language. Despite its simplicity, our approach outperforms previous state-of-the-art gloss-free frameworks on two SLT benchmarks and achieves competitive results compared to gloss-based methods.
