Table of Contents
Fetching ...

Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

Pooya Fayyazsanavi, Antonios Anastasopoulos, Jana Košecká

TL;DR

The paper addresses sign language Gloss2Text translation by converting gloss sequences into spoken German using fine-tuned large language models. It combines data augmentation (paraphrasing via an intermediate language and back-translation) with a Semantically Aware Label Smoothing loss that leverages word embedding similarities to soften incorrect predictions near the target meaning. On the PHOENIX-2014T dataset, the approach achieves state-of-the-art results with notable gains in BLEU, ROUGE, and CHRF++ while using parameter-efficient adapters, and ablations confirm the contributions of SALS and data augmentation. The work demonstrates the potential of LLM-based gloss-to-text translation for sign languages and highlights directions for improving robustness and cross-domain generalization.

Abstract

Sign language translation from video to spoken text presents unique challenges owing to the distinct grammar, expression nuances, and high variation of visual appearance across different speakers and contexts. The intermediate gloss annotations of videos aim to guide the translation process. In our work, we focus on {\em Gloss2Text} translation stage and propose several advances by leveraging pre-trained large language models (LLMs), data augmentation, and novel label-smoothing loss function exploiting gloss translation ambiguities improving significantly the performance of state-of-the-art approaches. Through extensive experiments and ablation studies on the PHOENIX Weather 2014T dataset, our approach surpasses state-of-the-art performance in {\em Gloss2Text} translation, indicating its efficacy in addressing sign language translation and suggesting promising avenues for future research and development.

Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

TL;DR

The paper addresses sign language Gloss2Text translation by converting gloss sequences into spoken German using fine-tuned large language models. It combines data augmentation (paraphrasing via an intermediate language and back-translation) with a Semantically Aware Label Smoothing loss that leverages word embedding similarities to soften incorrect predictions near the target meaning. On the PHOENIX-2014T dataset, the approach achieves state-of-the-art results with notable gains in BLEU, ROUGE, and CHRF++ while using parameter-efficient adapters, and ablations confirm the contributions of SALS and data augmentation. The work demonstrates the potential of LLM-based gloss-to-text translation for sign languages and highlights directions for improving robustness and cross-domain generalization.

Abstract

Sign language translation from video to spoken text presents unique challenges owing to the distinct grammar, expression nuances, and high variation of visual appearance across different speakers and contexts. The intermediate gloss annotations of videos aim to guide the translation process. In our work, we focus on {\em Gloss2Text} translation stage and propose several advances by leveraging pre-trained large language models (LLMs), data augmentation, and novel label-smoothing loss function exploiting gloss translation ambiguities improving significantly the performance of state-of-the-art approaches. Through extensive experiments and ablation studies on the PHOENIX Weather 2014T dataset, our approach surpasses state-of-the-art performance in {\em Gloss2Text} translation, indicating its efficacy in addressing sign language translation and suggesting promising avenues for future research and development.
Paper Structure (13 sections, 2 equations, 4 figures, 7 tables)

This paper contains 13 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: An example of ambiguity in sign language is demonstrated by the gloss "BEWOELKT (CLOUDY)," which is represented in multiple translations within the dataset. As shown, ambiguity may share the same meaning but differ in form, such as "wolken (cloudy)," or where the gloss represents the concept meaning, such as "unbeständig (unstable)."
  • Figure 2: The proposed architecture for Gloss2Text translation. Initially, the similarity of each word to others is compared. During training with label smoothing, depicted on the left side, the model aims to identify the most similar words to the target word and assign heightened labels to those words.
  • Figure 3: Comparison of word-level F-measure scores across different frequency buckets with chen2022two.
  • Figure 4: Comparison of sentence-level F-measure scores across different frequency buckets with chen2022two.