Table of Contents
Fetching ...

Spatio-temporal transformer to support automatic sign language translation

Christian Ruiz, Fabio Martinez

TL;DR

The paper tackles automatic sign language translation by introducing a spatio-temporal Transformer that preserves local and long-range spatial information in sign gestures through 2D CNN feature maps, 2D positional encodings, and a pixel-wise 2D self-attention module. The architecture operates on optical-flow representations within an encoder–decoder setup and leverages gloss supervision via CTC, achieving strong BLEU4 performance on SLT benchmarks. Evaluations on CoL-SLTD and PHOENIX-WEATHER-2014T demonstrate improved translation quality and robustness to real-world variations, with a reported BLEU4 of 46.84% on CoL-SLTD and 30.77% on PHOENIX14T in the abstract, and a BLEU4 of 0.5137 on CoL-SLTD in the results. While effective, the approach is computationally intensive, motivating future work on efficiency, 3D convolutions, and expanded multi-head attention to enhance scalability and generalization.

Abstract

Sign Language Translation (SLT) systems support hearing-impaired people communication by finding equivalences between signed and spoken languages. This task is however challenging due to multiple sign variations, complexity in language and inherent richness of expressions. Computational approaches have evidenced capabilities to support SLT. Nonetheless, these approaches remain limited to cover gestures variability and support long sequence translations. This paper introduces a Transformer-based architecture that encodes spatio-temporal motion gestures, preserving both local and long-range spatial information through the use of multiple convolutional and attention mechanisms. The proposed approach was validated on the Colombian Sign Language Translation Dataset (CoL-SLTD) outperforming baseline approaches, and achieving a BLEU4 of 46.84%. Additionally, the proposed approach was validated on the RWTH-PHOENIX-Weather-2014T (PHOENIX14T), achieving a BLEU4 score of 30.77%, demonstrating its robustness and effectiveness in handling real-world variations

Spatio-temporal transformer to support automatic sign language translation

TL;DR

The paper tackles automatic sign language translation by introducing a spatio-temporal Transformer that preserves local and long-range spatial information in sign gestures through 2D CNN feature maps, 2D positional encodings, and a pixel-wise 2D self-attention module. The architecture operates on optical-flow representations within an encoder–decoder setup and leverages gloss supervision via CTC, achieving strong BLEU4 performance on SLT benchmarks. Evaluations on CoL-SLTD and PHOENIX-WEATHER-2014T demonstrate improved translation quality and robustness to real-world variations, with a reported BLEU4 of 46.84% on CoL-SLTD and 30.77% on PHOENIX14T in the abstract, and a BLEU4 of 0.5137 on CoL-SLTD in the results. While effective, the approach is computationally intensive, motivating future work on efficiency, 3D convolutions, and expanded multi-head attention to enhance scalability and generalization.

Abstract

Sign Language Translation (SLT) systems support hearing-impaired people communication by finding equivalences between signed and spoken languages. This task is however challenging due to multiple sign variations, complexity in language and inherent richness of expressions. Computational approaches have evidenced capabilities to support SLT. Nonetheless, these approaches remain limited to cover gestures variability and support long sequence translations. This paper introduces a Transformer-based architecture that encodes spatio-temporal motion gestures, preserving both local and long-range spatial information through the use of multiple convolutional and attention mechanisms. The proposed approach was validated on the Colombian Sign Language Translation Dataset (CoL-SLTD) outperforming baseline approaches, and achieving a BLEU4 of 46.84%. Additionally, the proposed approach was validated on the RWTH-PHOENIX-Weather-2014T (PHOENIX14T), achieving a BLEU4 score of 30.77%, demonstrating its robustness and effectiveness in handling real-world variations

Paper Structure

This paper contains 14 sections, 4 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Pipeline of the proposed Transformer architecture. 1) Input representation. 2) Two-dimensional Self-Attention. 3) Feed-forward network.
  • Figure 2: Attention maps from the self-attention mechanism in a Transformer's decoder. The input sequence $x_1, x_2, \ldots, x_n$ is passed through a self-attention mechanism, which produces attention maps that weight each element in the input sequence according to its relevance to each other element.