Table of Contents
Fetching ...

Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation

Edward Fish, Richard Bowden

TL;DR

Geo-Sign tackles SLT by embedding skeletal motion into a hyperbolic space to respect the hierarchical structure of sign kinematics. It projects ST-GCN skeletal features into the Poincaré ball with a learnable curvature $c$ and regularises a pre-trained mT5 translator via a geometry-aware contrastive loss, explored through global and token-based alignment strategies. The approach yields significant gains on CSL-Daily (e.g., BLEU-4 and ROUGE-L improvements over pose baselines) and is competitive with RGB methods, while preserving signer privacy and preserving inference-time efficiency. The work demonstrates that hyperbolic geometry provides a principled inductive bias for discriminating fine-grained hand articulations and broader body movements, with potential extensions to multiple sign languages and other skeletal-based tasks.

Abstract

Recent progress in Sign Language Translation (SLT) has focussed primarily on improving the representational capacity of large language models to incorporate Sign Language features. This work explores an alternative direction: enhancing the geometric properties of skeletal representations themselves. We propose Geo-Sign, a method that leverages the properties of hyperbolic geometry to model the hierarchical structure inherent in sign language kinematics. By projecting skeletal features derived from Spatio-Temporal Graph Convolutional Networks (ST-GCNs) into the Poincaré ball model, we aim to create more discriminative embeddings, particularly for fine-grained motions like finger articulations. We introduce a hyperbolic projection layer, a weighted Fréchet mean aggregation scheme, and a geometric contrastive loss operating directly in hyperbolic space. These components are integrated into an end-to-end translation framework as a regularisation function, to enhance the representations within the language model. This work demonstrates the potential of hyperbolic geometry to improve skeletal representations for Sign Language Translation, improving on SOTA RGB methods while preserving privacy and improving computational efficiency. Code available here: https://github.com/ed-fish/geo-sign.

Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation

TL;DR

Geo-Sign tackles SLT by embedding skeletal motion into a hyperbolic space to respect the hierarchical structure of sign kinematics. It projects ST-GCN skeletal features into the Poincaré ball with a learnable curvature and regularises a pre-trained mT5 translator via a geometry-aware contrastive loss, explored through global and token-based alignment strategies. The approach yields significant gains on CSL-Daily (e.g., BLEU-4 and ROUGE-L improvements over pose baselines) and is competitive with RGB methods, while preserving signer privacy and preserving inference-time efficiency. The work demonstrates that hyperbolic geometry provides a principled inductive bias for discriminating fine-grained hand articulations and broader body movements, with potential extensions to multiple sign languages and other skeletal-based tasks.

Abstract

Recent progress in Sign Language Translation (SLT) has focussed primarily on improving the representational capacity of large language models to incorporate Sign Language features. This work explores an alternative direction: enhancing the geometric properties of skeletal representations themselves. We propose Geo-Sign, a method that leverages the properties of hyperbolic geometry to model the hierarchical structure inherent in sign language kinematics. By projecting skeletal features derived from Spatio-Temporal Graph Convolutional Networks (ST-GCNs) into the Poincaré ball model, we aim to create more discriminative embeddings, particularly for fine-grained motions like finger articulations. We introduce a hyperbolic projection layer, a weighted Fréchet mean aggregation scheme, and a geometric contrastive loss operating directly in hyperbolic space. These components are integrated into an end-to-end translation framework as a regularisation function, to enhance the representations within the language model. This work demonstrates the potential of hyperbolic geometry to improve skeletal representations for Sign Language Translation, improving on SOTA RGB methods while preserving privacy and improving computational efficiency. Code available here: https://github.com/ed-fish/geo-sign.

Paper Structure

This paper contains 48 sections, 1 theorem, 16 equations, 5 figures, 7 tables, 1 algorithm.

Key Result

Proposition D.1

The Poincaré ball $\mathbb B_c^{d}$ is a Hadamard manifold, hence $\mathcal{F}$ is strictly convex and has a unique minimiser $\mu^\star$. Let $L$ be the Lipschitz constant of $\nabla\mathcal{F}$ on the geodesic convex hull of $\{x_i\}$. If $0<\eta_k\le 2/L$ for all $k$, the iterates eq:supp_frechet

Figures (5)

  • Figure 1: Geo-Sign’s hyperbolic framework: (Left) Skeletal features from ST-GCN's for different body parts are projected into a Poincaré ball whose curvature is learned, while the original branch fuses the features for processing via the MT5 language model. (Pooled) The pose features are aggregated via Frechet Mean in Eq.\ref{['alg:frechet_mean_balanced_v1']}, while the text embeddings from the final layer of the MT5 model are pooled and projected to the hyperbolic manifold. Geodesic distance between the text embedding and the mean pose features are minimised for positive samples using the contrastive loss in Eq.\ref{['eq:method_hyperbolic_contrastive_loss_overall_balanced_v1']}. (Token) Alternatively, hyperbolic pose features are used as attention queries against all text embeddings to generate a pose-contextual text embedding. Note the movement of the text features $c_{pi}$ in grey towards the pose feature in blue. (Right) A representation of the Poincaré disk demonstrating the difference between Token, and Pooled methods in the tangent space.
  • Figure 2: UMAP projection of pose part summary embeddings ($\bar{\mathbf{f}}_p$ onto the 2D Poincaré disk). (Left) Embeddings from the Euclidean Token regularisation model ($c=0.001$). (Right) Embeddings from the Geo-Sign (Hyperbolic Token) model. The hyperbolic embeddings show a more structured distribution, with hand features (representing finer details) often pushed towards the periphery indicative of a learned kinematic hierarchy.
  • Figure 3: Evolution of the learnable manifold curvature $c$ during training for different initializations. (\ref{['fig:curvature_a']}) When initialised at $c=1.50$, the curvature magnitude slightly decreases, suggesting an optimal value around $1.42$ for this setup. (\ref{['fig:curvature_b']}) When initialised at a low $c=0.10$, the curvature increases, indicating the model benefits from more "hyperbolic space" initially. It stabilizes around $c=0.20$, potentially influenced by the dynamic $\alpha$ schedule that reduces regularization emphasis over time.
  • Figure 4: Plot of the geodesic distances from the origin ($\mathbf{0}$) of the Poincaré disk to the hyperbolic pose embeddings ($\mathbf{h}_p$) during training, averaged per part type. This shows how features for different parts utilize the hyperbolic space. For instance, right hand features (often conveying detailed lexical information) tend to move further from the origin, leveraging more of the hyperbolic curvature for discriminability. Body and face features, which might represent broader semantics or prosody, may remain closer to the Euclidean-like central region.
  • Figure 5: PCA projection of 1000 hyperbolic pose part embeddings (log-mapped to the tangent space at origin, then PCA-reduced to 2D) visualised within the Poincaré disk. Body features (blue) are tightly clustered near the origin, suggesting their discriminability is well-handled in a more Euclidean-like region. Hand features (left: red square, right: pink diamond) and face features (light blue triangle) are more dispersed, with hand features often pushed towards the periphery. This indicates these parts benefit from the increased representational capacity near the boundary of the Poincaré disk, where hyperbolic geometry provides more "space" to distinguish subtle variations crucial for sign language semantics.

Theorems & Definitions (1)

  • Proposition D.1: Convergence in $\mathbb B_c^{d}$