A Transformer-Based Framework for Greek Sign Language Production using Extended Skeletal Motion Representations
Chrysa Pratikaki, Panagiotis Filntisis, Athanasios Katsamanis, Anastasios Roussos, Petros Maragos
TL;DR
The paper addresses Greek Sign Language Production (SLP) by building a Transformer-based pipeline that maps text to sign-language pose sequences using an extended skeleton representation derived from MediaPipe Holistic. Key innovations include a hybrid training schedule that alternates between teacher forcing and autoregressive decoding, and a pose-to-text translation loss $L_T$ to reinforce forward translation, with additional exploration of gloss extraction as an intermediate step. Experiments on Elementary23 demonstrate improved sequence quality and alignment, while signer-specific and signer-independent analyses reveal strengths and limitations in generalization. Overall, the work advances Greek SLP by leveraging extended skeletal representations and gloss-based strategies, with potential educational applications and future work toward photorealistic SL video synthesis that preserves sign language integrity and user ethics.
Abstract
Sign Languages are the primary form of communication for Deaf communities across the world. To break the communication barriers between the Deaf and Hard-of-Hearing and the hearing communities, it is imperative to build systems capable of translating the spoken language into sign language and vice versa. Building on insights from previous research, we propose a deep learning model for Sign Language Production (SLP), which to our knowledge is the first attempt on Greek SLP. We tackle this task by utilizing a transformer-based architecture that enables the translation from text input to human pose keypoints, and the opposite. We evaluate the effectiveness of the proposed pipeline on the Greek SL dataset Elementary23, through a series of comparative analyses and ablation studies. Our pipeline's components, which include data-driven gloss generation, training through video to text translation and a scheduling algorithm for teacher forcing - auto-regressive decoding seem to actively enhance the quality of produced SL videos.
