Table of Contents
Fetching ...

Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation

Jianyuan Guo, Peike Li, Trevor Cohn

TL;DR

This paper addresses SLT under the constraint of scarce gloss annotations by generating pseudo glosses from spoken-language text with LLMs and refining their temporal order through weakly supervised reordering. It presents a three-stage training pipeline—Sign2Gloss, Gloss2Text, and Sign2Text—leveraging CTC and cross-entropy losses to bridge vision and language, while removing dependence on manual gloss labels. Empirical results on Phoenix14T and How2Sign show that gloss-free PGG-SLT with in-context LLM glossing and reordering can outperform previous gloss-free methods and rival gloss-based approaches, especially with stronger translators like Gemma2. This approach reduces annotation costs, scales to larger datasets, and offers a practical path toward robust SLT across languages and domains by aligning sign and spoken-language representations through pseudo gloss supervision.

Abstract

Sign Language Translation (SLT) aims to map sign language videos to spoken language text. A common approach relies on gloss annotations as an intermediate representation, decomposing SLT into two sub-tasks: video-to-gloss recognition and gloss-to-text translation. While effective, this paradigm depends on expert-annotated gloss labels, which are costly and rarely available in existing datasets, limiting its scalability. To address this challenge, we propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses while preserving the structured intermediate representation. Specifically, we prompt a Large Language Model (LLM) with a few example text-gloss pairs using in-context learning to produce draft sign glosses from spoken language text. To enhance the correspondence between LLM-generated pseudo glosses and the sign sequences in video, we correct the ordering in the pseudo glosses for better alignment via a weakly supervised learning process. This reordering facilitates the incorporation of auxiliary alignment objectives, and allows for the use of efficient supervision via a Connectionist Temporal Classification (CTC) loss. We train our SLT mode, which consists of a vision encoder and a translator, through a three-stage pipeline, which progressively narrows the modality gap between sign language and spoken language. Despite its simplicity, our approach outperforms previous state-of-the-art gloss-free frameworks on two SLT benchmarks and achieves competitive results compared to gloss-based methods.

Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation

TL;DR

This paper addresses SLT under the constraint of scarce gloss annotations by generating pseudo glosses from spoken-language text with LLMs and refining their temporal order through weakly supervised reordering. It presents a three-stage training pipeline—Sign2Gloss, Gloss2Text, and Sign2Text—leveraging CTC and cross-entropy losses to bridge vision and language, while removing dependence on manual gloss labels. Empirical results on Phoenix14T and How2Sign show that gloss-free PGG-SLT with in-context LLM glossing and reordering can outperform previous gloss-free methods and rival gloss-based approaches, especially with stronger translators like Gemma2. This approach reduces annotation costs, scales to larger datasets, and offers a practical path toward robust SLT across languages and domains by aligning sign and spoken-language representations through pseudo gloss supervision.

Abstract

Sign Language Translation (SLT) aims to map sign language videos to spoken language text. A common approach relies on gloss annotations as an intermediate representation, decomposing SLT into two sub-tasks: video-to-gloss recognition and gloss-to-text translation. While effective, this paradigm depends on expert-annotated gloss labels, which are costly and rarely available in existing datasets, limiting its scalability. To address this challenge, we propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses while preserving the structured intermediate representation. Specifically, we prompt a Large Language Model (LLM) with a few example text-gloss pairs using in-context learning to produce draft sign glosses from spoken language text. To enhance the correspondence between LLM-generated pseudo glosses and the sign sequences in video, we correct the ordering in the pseudo glosses for better alignment via a weakly supervised learning process. This reordering facilitates the incorporation of auxiliary alignment objectives, and allows for the use of efficient supervision via a Connectionist Temporal Classification (CTC) loss. We train our SLT mode, which consists of a vision encoder and a translator, through a three-stage pipeline, which progressively narrows the modality gap between sign language and spoken language. Despite its simplicity, our approach outperforms previous state-of-the-art gloss-free frameworks on two SLT benchmarks and achieves competitive results compared to gloss-based methods.

Paper Structure

This paper contains 19 sections, 13 equations, 8 figures, 14 tables, 1 algorithm.

Figures (8)

  • Figure 1: The training pipeline comprises three stages. Pseudo glosses are generated by LLMs and reordered using weakly supervised learning paradigms (see Sec. \ref{['sec:method']}).
  • Figure 2: BLEU4 vs. epochs on Phoenix14T test set.
  • Figure 3: Impact of different training stages.
  • Figure 4: Example prompt used for pseudo gloss generation with Gemini 1.5 Pro gemini. The prompt contains a few example text-gloss pairs to guide the LLM in generating well-structured glosses for the query text. Detailed prompt formatting can be found in Figure \ref{['fig:supp_phx_llm_prompt']} and Figure \ref{['fig:supp_how2sign_prompt']}.
  • Figure 5: The prompt sent to the LLM for generating pseudo glosses includes two example pairs. These two examples serve as references to guide the LLM in producing accurate pseudo glosses during generation. The text marked in red represents spoken language from the datasets, which should be replaced during each iteration.
  • ...and 3 more figures