Table of Contents
Fetching ...

Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation

Ryan Wong, Necati Cihan Camgoz, Richard Bowden

TL;DR

This work tackles gloss-free sign language translation by leveraging large pretrained vision and language models through lightweight adapters. Sign2GPT freezes a DinoV2 spatial backbone and an XGLM decoder, while training a dedicated sign encoder via a novel pseudo-gloss pretraining strategy that aligns visual sign representations with automatically generated pseudo-glosses. The approach yields state-of-the-art gloss-free results on Phoenix14T and CSL-Daily, closing the gap to gloss-based SLT and reducing data and compute requirements. The method demonstrates the feasibility of integrating visual priors with frozen language models for sign language translation, with practical implications for scalable, data-efficient SLT deployment.

Abstract

Automatic Sign Language Translation requires the integration of both computer vision and natural language processing to effectively bridge the communication gap between sign and spoken languages. However, the deficiency in large-scale training data to support sign language translation means we need to leverage resources from spoken language. We introduce, Sign2GPT, a novel framework for sign language translation that utilizes large-scale pretrained vision and language models via lightweight adapters for gloss-free sign language translation. The lightweight adapters are crucial for sign language translation, due to the constraints imposed by limited dataset sizes and the computational requirements when training with long sign videos. We also propose a novel pretraining strategy that directs our encoder to learn sign representations from automatically extracted pseudo-glosses without requiring gloss order information or annotations. We evaluate our approach on two public benchmark sign language translation datasets, namely RWTH-PHOENIX-Weather 2014T and CSL-Daily, and improve on state-of-the-art gloss-free translation performance with a significant margin.

Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation

TL;DR

This work tackles gloss-free sign language translation by leveraging large pretrained vision and language models through lightweight adapters. Sign2GPT freezes a DinoV2 spatial backbone and an XGLM decoder, while training a dedicated sign encoder via a novel pseudo-gloss pretraining strategy that aligns visual sign representations with automatically generated pseudo-glosses. The approach yields state-of-the-art gloss-free results on Phoenix14T and CSL-Daily, closing the gap to gloss-based SLT and reducing data and compute requirements. The method demonstrates the feasibility of integrating visual priors with frozen language models for sign language translation, with practical implications for scalable, data-efficient SLT deployment.

Abstract

Automatic Sign Language Translation requires the integration of both computer vision and natural language processing to effectively bridge the communication gap between sign and spoken languages. However, the deficiency in large-scale training data to support sign language translation means we need to leverage resources from spoken language. We introduce, Sign2GPT, a novel framework for sign language translation that utilizes large-scale pretrained vision and language models via lightweight adapters for gloss-free sign language translation. The lightweight adapters are crucial for sign language translation, due to the constraints imposed by limited dataset sizes and the computational requirements when training with long sign videos. We also propose a novel pretraining strategy that directs our encoder to learn sign representations from automatically extracted pseudo-glosses without requiring gloss order information or annotations. We evaluate our approach on two public benchmark sign language translation datasets, namely RWTH-PHOENIX-Weather 2014T and CSL-Daily, and improve on state-of-the-art gloss-free translation performance with a significant margin.
Paper Structure (33 sections, 4 equations, 6 figures, 11 tables)

This paper contains 33 sections, 4 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Overview of Sign2GPT, which consists of a pretraining stage that makes use of pseudo-glosses and downstream translation that leverages a frozen GPT model.
  • Figure 2: Overview of adapting layers in the spatial model layers (left) and decoder layer (right). We make use of adapters that introduce new low-rank weights to blocks shown by the dashed lines while keeping the original pretrained weights frozen.
  • Figure 3: Overview of pretraining process, which takes the sign features as input and predicts the existence of pseudo-glosses.
  • Figure 4: Visualizations of the localization capabilities of our pretraining stage. We visualize only the pseudo-glosses from the target sentence (y-axis) over time (x-axis), with whiter regions indicating a higher probability of the pseudo-gloss occurring during the time segment. We also display the localized gloss (under the video frames) based on a threshold of 0.2 on $E$.
  • Figure 5: Visualizations of the localization capabilities of our pretraining stage on the pseudo-glosses from the Phoenix14T dataset. We visualize only the pseudo-glosses from the target sentence (y-axis) over time (x-axis), with whiter regions indicating a higher probability of the pseudo-gloss occurring during the time segment. We also display the localized gloss (under the video frames) based on a threshold of 0.2 on $E$.
  • ...and 1 more figures