Table of Contents
Fetching ...

STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

Suvajit Patra, Soumitra Samanta

Abstract

Continuous Sign Language Recognition (CSLR) is a crucial task for understanding the languages of deaf communities. Contemporary keypoint-based approaches typically rely on spatio-temporal encoding, where spatial interactions among keypoints are modeled using Graph Convolutional Networks or attention mechanisms, while temporal dynamics are captured using 1D convolutional networks. However, such designs often introduce a large number of parameters in both the encoder and the decoder. This paper introduces a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), and aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder contains approximately $70-80\%$ fewer parameters than existing state-of-the-art models while achieving comparable performance to keypoint-based methods on the Phoenix-14T dataset.

STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

Abstract

Continuous Sign Language Recognition (CSLR) is a crucial task for understanding the languages of deaf communities. Contemporary keypoint-based approaches typically rely on spatio-temporal encoding, where spatial interactions among keypoints are modeled using Graph Convolutional Networks or attention mechanisms, while temporal dynamics are captured using 1D convolutional networks. However, such designs often introduce a large number of parameters in both the encoder and the decoder. This paper introduces a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), and aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder contains approximately fewer parameters than existing state-of-the-art models while achieving comparable performance to keypoint-based methods on the Phoenix-14T dataset.
Paper Structure (11 sections, 8 equations, 1 figure, 1 table)

This paper contains 11 sections, 8 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Overview of the proposed STARK model architecture. A sign video is represented as a keypoint tensor $X_{input} \in \mathbb{R}^{d \times T \times P}$ containing $x$, $y$ coordinates and confidence scores for $P$ joints over $T$ frames. The input is projected with a linear layer, followed by the addition of positional encoding. Stacked STARK blocks jointly model temporal relations between the same keypoints across neighboring frames and spatial relations between different keypoints within each frame using spatio-temporal attention. The resulting features are aggregated with average pooling over keypoints and temporally downsampled with max pooling, producing a compact representation of size $D \times T/4$, which is then passed to the gloss decoder for gloss recognition.