Table of Contents
Fetching ...

A Transformer-Based Framework for Greek Sign Language Production using Extended Skeletal Motion Representations

Chrysa Pratikaki, Panagiotis Filntisis, Athanasios Katsamanis, Anastasios Roussos, Petros Maragos

TL;DR

The paper addresses Greek Sign Language Production (SLP) by building a Transformer-based pipeline that maps text to sign-language pose sequences using an extended skeleton representation derived from MediaPipe Holistic. Key innovations include a hybrid training schedule that alternates between teacher forcing and autoregressive decoding, and a pose-to-text translation loss $L_T$ to reinforce forward translation, with additional exploration of gloss extraction as an intermediate step. Experiments on Elementary23 demonstrate improved sequence quality and alignment, while signer-specific and signer-independent analyses reveal strengths and limitations in generalization. Overall, the work advances Greek SLP by leveraging extended skeletal representations and gloss-based strategies, with potential educational applications and future work toward photorealistic SL video synthesis that preserves sign language integrity and user ethics.

Abstract

Sign Languages are the primary form of communication for Deaf communities across the world. To break the communication barriers between the Deaf and Hard-of-Hearing and the hearing communities, it is imperative to build systems capable of translating the spoken language into sign language and vice versa. Building on insights from previous research, we propose a deep learning model for Sign Language Production (SLP), which to our knowledge is the first attempt on Greek SLP. We tackle this task by utilizing a transformer-based architecture that enables the translation from text input to human pose keypoints, and the opposite. We evaluate the effectiveness of the proposed pipeline on the Greek SL dataset Elementary23, through a series of comparative analyses and ablation studies. Our pipeline's components, which include data-driven gloss generation, training through video to text translation and a scheduling algorithm for teacher forcing - auto-regressive decoding seem to actively enhance the quality of produced SL videos.

A Transformer-Based Framework for Greek Sign Language Production using Extended Skeletal Motion Representations

TL;DR

The paper addresses Greek Sign Language Production (SLP) by building a Transformer-based pipeline that maps text to sign-language pose sequences using an extended skeleton representation derived from MediaPipe Holistic. Key innovations include a hybrid training schedule that alternates between teacher forcing and autoregressive decoding, and a pose-to-text translation loss to reinforce forward translation, with additional exploration of gloss extraction as an intermediate step. Experiments on Elementary23 demonstrate improved sequence quality and alignment, while signer-specific and signer-independent analyses reveal strengths and limitations in generalization. Overall, the work advances Greek SLP by leveraging extended skeletal representations and gloss-based strategies, with potential educational applications and future work toward photorealistic SL video synthesis that preserves sign language integrity and user ethics.

Abstract

Sign Languages are the primary form of communication for Deaf communities across the world. To break the communication barriers between the Deaf and Hard-of-Hearing and the hearing communities, it is imperative to build systems capable of translating the spoken language into sign language and vice versa. Building on insights from previous research, we propose a deep learning model for Sign Language Production (SLP), which to our knowledge is the first attempt on Greek SLP. We tackle this task by utilizing a transformer-based architecture that enables the translation from text input to human pose keypoints, and the opposite. We evaluate the effectiveness of the proposed pipeline on the Greek SL dataset Elementary23, through a series of comparative analyses and ablation studies. Our pipeline's components, which include data-driven gloss generation, training through video to text translation and a scheduling algorithm for teacher forcing - auto-regressive decoding seem to actively enhance the quality of produced SL videos.

Paper Structure

This paper contains 16 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overview of the proposed architecture: Given a text sentence as input, our SLP pipeline generates the corresponding sign language sequence. During training, the Encoder-Decoder structure learns through a sum of MSE Regression Loss (between frames) and CTC (pose-to-text) Loss. Optionally, training can happen using data-driven generated glosses to limit lexical diversity.
  • Figure 2: Extended Skeleton Representation based on MediaPipe Holistic mediapipe_holistic: (a) Original 33 MP pose landmarks. (b) Selected 8 MP pose landmarks for SLP. (c) Original 478 MP face landmarks. (d) Selected 141 MP face landmarks for SLP. (e) MP hands.
  • Figure 3: Proposed Sign Language Production Transformer
  • Figure 4: Proposed Sign Language Translation Transformer
  • Figure 5: Comparison of the averaged DTW results on the Math and Greek Test Subsets. Again the hybrid combination of teacher forcing and auto-regressive decoding during training significantly improves sequence alignment.
  • ...and 2 more figures