Stack Transformer Based Spatial-Temporal Attention Model for Dynamic Sign Language and Fingerspelling Recognition
Koki Hirooka, Abu Saleh Musa Miah, Tatsuya Murakami, Md. Al Mehedi Hasan, Yong Seok Hwang, Jungpil Shin
TL;DR
This work tackles the challenge of sign language recognition with skeleton data by moving beyond fixed graphs. It introduces SSTAN, a post-normalization, stacked Transformer architecture that sequentially applies Spatial MHA and Temporal MHA to model intra-frame and inter-frame dynamics without predefined skeleton graphs. The method achieves state-of-the-art results for Japanese and Korean fingerspelling and establishes a new skeleton-only SOTA on WLASL when trained from scratch, underscoring its data efficiency. The findings suggest strong potential for further gains through SSL pre-training and multi-modal (RGB) fusion, as well as integration with large vision-language models.
Abstract
Hand gesture-based Sign Language Recognition (SLR) serves as a crucial communication bridge between deaf and non-deaf individuals. While Graph Convolutional Networks (GCNs) are common, they are limited by their reliance on fixed skeletal graphs. To overcome this, we propose the Sequential Spatio-Temporal Attention Network (SSTAN), a novel Transformer-based architecture. Our model employs a hierarchical, stacked design that sequentially integrates Spatial Multi-Head Attention (MHA) to capture intra-frame joint relationships and Temporal MHA to model long-range inter-frame dependencies. This approach allows the model to efficiently learn complex spatio-temporal patterns without predefined graph structures. We validated our model through extensive experiments on diverse, large-scale datasets (WLASL, JSL, and KSL). A key finding is that our model, trained entirely from scratch, achieves state-of-the-art (SOTA) performance in the challenging fingerspelling categories (JSL and KSL). Furthermore, it establishes a new SOTA for skeleton-only methods on WLASL, outperforming several approaches that rely on complex self-supervised pre-training. These results demonstrate our model's high data efficiency and its effectiveness in capturing the intricate dynamics of sign language. The official implementation is available at our GitHub repository: \href{https://github.com/K-Hirooka-Aizu/skeleton-slr-transformer}{https://github.com/K-Hirooka-Aizu/skeleton-slr-transformer}.
