SkelCap: Automated Generation of Descriptive Text from Skeleton Keypoint Sequences

Ali Emre Keskin; Hacer Yalim Keles

SkelCap: Automated Generation of Descriptive Text from Skeleton Keypoint Sequences

Ali Emre Keskin, Hacer Yalim Keles

TL;DR

This work tackles data scarcity in sign language by pairing Turkish sign signs with textual descriptions through skeleton-keypoint representations. It introduces the AUTSL-SkelCap dataset and a baseline SkelCap model that maps 2D skeleton keypoints to spoken Turkish text via a transformer-based seq2seq architecture initialized from mT5-base. Evaluations show strong signer-agnostic performance but reveal limited generalization to unseen signs, highlighting a gap for zero-shot description generation and motivating further research. The dataset and baseline establish a foundation for scalable sign-language production from motion cues and enable broader research into text generation from skeletal data.

Abstract

Numerous sign language datasets exist, yet they typically cover only a limited selection of the thousands of signs used globally. Moreover, creating diverse sign language datasets is an expensive and challenging task due to the costs associated with gathering a varied group of signers. Motivated by these challenges, we aimed to develop a solution that addresses these limitations. In this context, we focused on textually describing body movements from skeleton keypoint sequences, leading to the creation of a new dataset. We structured this dataset around AUTSL, a comprehensive isolated Turkish sign language dataset. We also developed a baseline model, SkelCap, which can generate textual descriptions of body movements. This model processes the skeleton keypoints data as a vector, applies a fully connected layer for embedding, and utilizes a transformer neural network for sequence-to-sequence modeling. We conducted extensive evaluations of our model, including signer-agnostic and sign-agnostic assessments. The model achieved promising results, with a ROUGE-L score of 0.98 and a BLEU-4 score of 0.94 in the signer-agnostic evaluation. The dataset we have prepared, namely the AUTSL-SkelCap, will be made publicly available soon.

SkelCap: Automated Generation of Descriptive Text from Skeleton Keypoint Sequences

TL;DR

Abstract

Paper Structure (16 sections, 5 figures, 7 tables)

This paper contains 16 sections, 5 figures, 7 tables.

Introduction
Related Work
Action Recognition Datasets
Sign Language Recognition and Translation
Materials and Methods
Dataset Preparation
Isolated Sign Language Video Dataset
Mapping between Sings and Textual Descriptions
Manual Annotation of the Hand Shapes and the Alternative Pronunciations
Mapping between Videos and Skeleton Sequences
Skeleton Normalization
Structured Storing of the Skeleton Sequences
SkelCap: Proposed Architecture
Training
Experimental Results
...and 1 more sections

Figures (5)

Figure 1: A sample representative frame sequence from AUTSL dataset for "key" ( "anahtar" in Turkish ). Top: selected video frames, bottom: skeleton keypoint sequence.
Figure 2: Mapping between textual descriptions and skeleton sequences.
Figure 3: Sample description of the 'key' sign from the Turkish Sign Language Dictionary (TSLD): 'The right hand is at chest level, shaped like a fist, with the index finger protruding forward and adjacent to the thumb (T hand). The right hand then rotates from the wrist to the right and left twice.' This image, adapted from a representative frame of the AUTSL dataset, illustrates how TSLD describes each sign.
Figure 4: Distribution of normalized skeleton points coordinates in x and y axes.
Figure 5: Our sign skeleton sequence to text method.

SkelCap: Automated Generation of Descriptive Text from Skeleton Keypoint Sequences

TL;DR

Abstract

SkelCap: Automated Generation of Descriptive Text from Skeleton Keypoint Sequences

Authors

TL;DR

Abstract

Table of Contents

Figures (5)