Table of Contents
Fetching ...

Stack Transformer Based Spatial-Temporal Attention Model for Dynamic Sign Language and Fingerspelling Recognition

Koki Hirooka, Abu Saleh Musa Miah, Tatsuya Murakami, Md. Al Mehedi Hasan, Yong Seok Hwang, Jungpil Shin

TL;DR

This work tackles the challenge of sign language recognition with skeleton data by moving beyond fixed graphs. It introduces SSTAN, a post-normalization, stacked Transformer architecture that sequentially applies Spatial MHA and Temporal MHA to model intra-frame and inter-frame dynamics without predefined skeleton graphs. The method achieves state-of-the-art results for Japanese and Korean fingerspelling and establishes a new skeleton-only SOTA on WLASL when trained from scratch, underscoring its data efficiency. The findings suggest strong potential for further gains through SSL pre-training and multi-modal (RGB) fusion, as well as integration with large vision-language models.

Abstract

Hand gesture-based Sign Language Recognition (SLR) serves as a crucial communication bridge between deaf and non-deaf individuals. While Graph Convolutional Networks (GCNs) are common, they are limited by their reliance on fixed skeletal graphs. To overcome this, we propose the Sequential Spatio-Temporal Attention Network (SSTAN), a novel Transformer-based architecture. Our model employs a hierarchical, stacked design that sequentially integrates Spatial Multi-Head Attention (MHA) to capture intra-frame joint relationships and Temporal MHA to model long-range inter-frame dependencies. This approach allows the model to efficiently learn complex spatio-temporal patterns without predefined graph structures. We validated our model through extensive experiments on diverse, large-scale datasets (WLASL, JSL, and KSL). A key finding is that our model, trained entirely from scratch, achieves state-of-the-art (SOTA) performance in the challenging fingerspelling categories (JSL and KSL). Furthermore, it establishes a new SOTA for skeleton-only methods on WLASL, outperforming several approaches that rely on complex self-supervised pre-training. These results demonstrate our model's high data efficiency and its effectiveness in capturing the intricate dynamics of sign language. The official implementation is available at our GitHub repository: \href{https://github.com/K-Hirooka-Aizu/skeleton-slr-transformer}{https://github.com/K-Hirooka-Aizu/skeleton-slr-transformer}.

Stack Transformer Based Spatial-Temporal Attention Model for Dynamic Sign Language and Fingerspelling Recognition

TL;DR

This work tackles the challenge of sign language recognition with skeleton data by moving beyond fixed graphs. It introduces SSTAN, a post-normalization, stacked Transformer architecture that sequentially applies Spatial MHA and Temporal MHA to model intra-frame and inter-frame dynamics without predefined skeleton graphs. The method achieves state-of-the-art results for Japanese and Korean fingerspelling and establishes a new skeleton-only SOTA on WLASL when trained from scratch, underscoring its data efficiency. The findings suggest strong potential for further gains through SSL pre-training and multi-modal (RGB) fusion, as well as integration with large vision-language models.

Abstract

Hand gesture-based Sign Language Recognition (SLR) serves as a crucial communication bridge between deaf and non-deaf individuals. While Graph Convolutional Networks (GCNs) are common, they are limited by their reliance on fixed skeletal graphs. To overcome this, we propose the Sequential Spatio-Temporal Attention Network (SSTAN), a novel Transformer-based architecture. Our model employs a hierarchical, stacked design that sequentially integrates Spatial Multi-Head Attention (MHA) to capture intra-frame joint relationships and Temporal MHA to model long-range inter-frame dependencies. This approach allows the model to efficiently learn complex spatio-temporal patterns without predefined graph structures. We validated our model through extensive experiments on diverse, large-scale datasets (WLASL, JSL, and KSL). A key finding is that our model, trained entirely from scratch, achieves state-of-the-art (SOTA) performance in the challenging fingerspelling categories (JSL and KSL). Furthermore, it establishes a new SOTA for skeleton-only methods on WLASL, outperforming several approaches that rely on complex self-supervised pre-training. These results demonstrate our model's high data efficiency and its effectiveness in capturing the intricate dynamics of sign language. The official implementation is available at our GitHub repository: \href{https://github.com/K-Hirooka-Aizu/skeleton-slr-transformer}{https://github.com/K-Hirooka-Aizu/skeleton-slr-transformer}.

Paper Structure

This paper contains 25 sections, 9 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Skeleton graph for FignerSpelling.
  • Figure 2: The overview of proposed architecture.
  • Figure 3: Pose and graph construction for WLASL.
  • Figure 4: The internal structure of Spatial Temporal Transformer Block.
  • Figure 5: Multihead Self Attention for spatial dimension.
  • ...and 1 more figures