Table of Contents
Fetching ...

A multitask transformer to sign language translation using motion gesture primitives

Fredy Alejandro Mendoza López, Jefferson Rodriguez, Fabio Martínez

TL;DR

This work introduces a multitask transformer for sign language translation that simultaneously learns gloss representations and written-language translations from dense motion cues. A Spatiotemporal Feature Extractor converts video motion into a kinematic embedding, which the Motion-gloss Recognition Transformer Encoder maps to glosses via MHA in a kinematic–gloss domain and CTC supervision, while the Motion-gloss Translation Transformer Decoder translates to written text with dual MHA modules and autoregressive masking. On CoL-SLTD and RWTH-PHOENIX-Weather 2014 T, the approach achieves strong performance, notably a BLEU-4 of $72.64 ext{%}$ on CoL-SLTD split 1 and a competitive BLEU-4 of $11.58 ext{%}$ on RWTH, while ablations show substantial gains from optical-flow-based motion representations and gloss-driven supervision. The method yields a compact, deployable SL translation pipeline that leverages motion geometry and gloss alignment to improve translation reliability and generalization, highlighting the practical value of intermediate gloss representations in sign-language processing.

Abstract

The absence of effective communication the deaf population represents the main social gap in this community. Furthermore, the sign language, main deaf communication tool, is unlettered, i.e., there is no formal written representation. In consequence, main challenge today is the automatic translation among spatiotemporal sign representation and natural text language. Recent approaches are based on encoder-decoder architectures, where the most relevant strategies integrate attention modules to enhance non-linear correspondences, besides, many of these approximations require complex training and architectural schemes to achieve reasonable predictions, because of the absence of intermediate text projections. However, they are still limited by the redundant background information of the video sequences. This work introduces a multitask transformer architecture that includes a gloss learning representation to achieve a more suitable translation. The proposed approach also includes a dense motion representation that enhances gestures and includes kinematic information, a key component in sign language. From this representation it is possible to avoid background information and exploit the geometry of the signs, in addition, it includes spatiotemporal representations that facilitate the alignment between gestures and glosses as an intermediate textual representation. The proposed approach outperforms the state-of-the-art evaluated on the CoL-SLTD dataset, achieving a BLEU-4 of 72,64% in split 1, and a BLEU-4 of 14,64% in split 2. Additionally, the strategy was validated on the RWTH-PHOENIX-Weather 2014 T dataset, achieving a competitive BLEU-4 of 11,58%.

A multitask transformer to sign language translation using motion gesture primitives

TL;DR

This work introduces a multitask transformer for sign language translation that simultaneously learns gloss representations and written-language translations from dense motion cues. A Spatiotemporal Feature Extractor converts video motion into a kinematic embedding, which the Motion-gloss Recognition Transformer Encoder maps to glosses via MHA in a kinematic–gloss domain and CTC supervision, while the Motion-gloss Translation Transformer Decoder translates to written text with dual MHA modules and autoregressive masking. On CoL-SLTD and RWTH-PHOENIX-Weather 2014 T, the approach achieves strong performance, notably a BLEU-4 of on CoL-SLTD split 1 and a competitive BLEU-4 of on RWTH, while ablations show substantial gains from optical-flow-based motion representations and gloss-driven supervision. The method yields a compact, deployable SL translation pipeline that leverages motion geometry and gloss alignment to improve translation reliability and generalization, highlighting the practical value of intermediate gloss representations in sign-language processing.

Abstract

The absence of effective communication the deaf population represents the main social gap in this community. Furthermore, the sign language, main deaf communication tool, is unlettered, i.e., there is no formal written representation. In consequence, main challenge today is the automatic translation among spatiotemporal sign representation and natural text language. Recent approaches are based on encoder-decoder architectures, where the most relevant strategies integrate attention modules to enhance non-linear correspondences, besides, many of these approximations require complex training and architectural schemes to achieve reasonable predictions, because of the absence of intermediate text projections. However, they are still limited by the redundant background information of the video sequences. This work introduces a multitask transformer architecture that includes a gloss learning representation to achieve a more suitable translation. The proposed approach also includes a dense motion representation that enhances gestures and includes kinematic information, a key component in sign language. From this representation it is possible to avoid background information and exploit the geometry of the signs, in addition, it includes spatiotemporal representations that facilitate the alignment between gestures and glosses as an intermediate textual representation. The proposed approach outperforms the state-of-the-art evaluated on the CoL-SLTD dataset, achieving a BLEU-4 of 72,64% in split 1, and a BLEU-4 of 14,64% in split 2. Additionally, the strategy was validated on the RWTH-PHOENIX-Weather 2014 T dataset, achieving a competitive BLEU-4 of 11,58%.

Paper Structure

This paper contains 20 sections, 2 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Proposed method. The optical flow is given as input to a volumetric strategy (STFE) that extracts at a low level the kinematic information from the video. This information is processed through the encoder (MGRTE), which, through nonlinear relationships between the spatiotemporal information, models and recognizes the glosses. On the other hand, the decoder (MGTTD) generates the translation based on temporal projections between the videos and the written language. STFE: Spatiotemporal Feature Extractor, MGRTE: Motion-gloss recognition transformer encoder, MGTTD: Motion-gloss translation transformer decoder.
  • Figure 2: Example of Brox optical flow representation. A set of frames from an SL video represented in Brox optical flow. Each frame in this representation contains visual features that highlight motion.
  • Figure 3: Spatiotemporal feature extractor (STFE). This proposed module extracts low-level and long-term kinematic features using a volumetric strategy. Each sequence results in a latent space with dimensions $L \times d$ containing the most relevant information from the video.
  • Figure 4: STFE convolutional block. Each block is composed of four layers that fulfill different functions. In the proposed methodology, six consecutive blocks were used.
  • Figure 5: Motion-gloss recognition transformer. This module determines a set of embedded $\mathbf{\hat{K}}$ of encoded information and a set $\mathbf{G}$ of glosses. For this, with the embedded space $\mathbf{K}$ as input, a positional encoding layer inserts positional information to each $K_l$ vector, then, this representation passes through an attention module multi-head, which finds nonlinear relationships between each element of $\mathbf{K}$. Finally, the information is given as input to a layer point-wise feed Forward followed by normalization modules. This sequence is repeated $B$ times, thus obtaining a matrix $\mathbf{\hat{K}}$ with encoded kinematic information. MHA: Attention multi-head, AN: Addition and normalization, FF: Layer, point-wise feed-forward, LN: Linear layer, $\mathbf{sigma:}$ Softmax Activation , GR: Gloss recognition module.
  • ...and 8 more figures