Table of Contents
Fetching ...

LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition

Muxin Pu, Mei Kuan Lim, Chun Yong Chong, Chen Change Loy

Abstract

Skeleton-based isolated sign language recognition (ISLR) demands fine-grained understanding of articulated motion across multiple spatial scales, from subtle finger movements to global body dynamics. Existing approaches typically rely on deep feed-forward architectures, which increase model capacity but lack mechanisms for recurrent refinement and structured representation. We propose LA-Sign, a looped transformer framework with geometry-aware alignment for ISLR. Instead of stacking deeper layers, LA-Sign derives its depth from recurrence, repeatedly revisiting latent representations to progressively refine motion understanding under shared parameters. To further regularise this refinement process, we present a geometry-aware contrastive objective that projects skeletal and textual features into an adaptive hyperbolic space, encouraging multi-scale semantic organisation. We study three looping designs and multiple geometric manifolds, demonstrating that encoder-decoder looping combined with adaptive Poincare alignment yields the strongest performance. Extensive experiments on WLASL and MSASL benchmarks show that LA-Sign achieves state-of-the-art results while using fewer unique layers, highlighting the effectiveness of recurrent latent refinement and geometry-aware representation learning for sign language recognition.

LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition

Abstract

Skeleton-based isolated sign language recognition (ISLR) demands fine-grained understanding of articulated motion across multiple spatial scales, from subtle finger movements to global body dynamics. Existing approaches typically rely on deep feed-forward architectures, which increase model capacity but lack mechanisms for recurrent refinement and structured representation. We propose LA-Sign, a looped transformer framework with geometry-aware alignment for ISLR. Instead of stacking deeper layers, LA-Sign derives its depth from recurrence, repeatedly revisiting latent representations to progressively refine motion understanding under shared parameters. To further regularise this refinement process, we present a geometry-aware contrastive objective that projects skeletal and textual features into an adaptive hyperbolic space, encouraging multi-scale semantic organisation. We study three looping designs and multiple geometric manifolds, demonstrating that encoder-decoder looping combined with adaptive Poincare alignment yields the strongest performance. Extensive experiments on WLASL and MSASL benchmarks show that LA-Sign achieves state-of-the-art results while using fewer unique layers, highlighting the effectiveness of recurrent latent refinement and geometry-aware representation learning for sign language recognition.

Paper Structure

This paper contains 36 sections, 18 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of LA-Sign. Motion sequences are first processed by a part-wise ST-GCN encoder to extract sign features, which are then fed into a looped transformer for recurrent refinement. We study three looping variants: (a) encoder–decoder, (b) encoder-focused, and (c) decoder-focused, to assess how modality interaction patterns affect refinement. Geometry-aware (GA) alignment regularises the latent space using a hyperbolic contrastive objective. Sign features are aggregated via the Fréchet mean, while text features are pooled and projected onto the hyperbolic manifold. The geodesic distance between text embeddings and mean sign features is minimised for positive pairs using $\mathcal{L}_{GA}$ (see Section \ref{['sec:GA']}). The refined representation is finally projected to text tokens for recognition.
  • Figure 2: Architectural details of the three recurrent looping variants. (a) Encoder-decoder looping: the initial sign representation $S$ is concatenated with the previous cross-modal state $H^{s2t}_{i-1}$ before passing through the shared encoder-decoder block. (b) Encoder-focused looping: the visual representation $H^{s}_i$ is iteratively refined via residual updates with $S$. (c) Decoder-focused looping: the encoder processes $S$ into a fixed visual representation $H^s$, which the looped decoder repeatedly references to refine the cross-modal interpretation $H^{s2t}_i$. In all variants, $T$ provides the linguistic context or decoder prefix. Features highlighted in green and blue denote intermediate and final representations, respectively; both are utilised in our GA alignment.
  • Figure 3: UMAP visualisations of learned embeddings. Compared to the Euclidean baseline (a), the hyperbolic space (b) exhibits clearer radial organisation and improved semantic separation.
  • Figure 4: Representation refinement across loop iterations. We visualise the embedding distributions of three sign classes (“before”, “chair”, and “go”) across successive loop iterations. At early iterations (i = 1), the embeddings exhibit substantial overlap, indicating weak semantic separation. As the number of loops increases, the representations progressively form clearer cluster structures, with reduced inter-class overlap. This behaviour suggests that iterative looping refines the latent representation and enhances semantic organisation in the embedding space.