Table of Contents
Fetching ...

Geometry-Aware Metric Learning for Cross-Lingual Few-Shot Sign Language Recognition on Static Hand Keypoints

Chayanin Chamachot, Kanokphan Lertniponphan

TL;DR

It is demonstrated that invariant hand-geometry descriptors provide a portable and effective foundation for cross-lingual few-shot SLR in low-resource settings and enable frozen cross-lingual transfer that frequently exceeds within-domain accuracy.

Abstract

Sign language recognition (SLR) systems typically require large labeled corpora for each language, yet the majority of the world's 300+ sign languages lack sufficient annotated data. Cross-lingual few-shot transfer, pretraining on a data-rich source language and adapting with only a handful of target-language examples, offers a scalable alternative, but conventional coordinate-based keypoint representations are susceptible to domain shift arising from differences in camera viewpoint, hand scale, and recording conditions. This shift is particularly detrimental in the few-shot regime, where class prototypes estimated from only K examples are highly sensitive to extrinsic variance. We propose a geometry-aware metric-learning framework centered on a compact 20-dimensional inter-joint angle descriptor derived from MediaPipe static hand keypoints. These angles are invariant to SO(3) rotation, translation, and isotropic scaling, eliminating the dominant sources of cross-dataset shift and yielding tighter, more stable class prototypes. Evaluated on four fingerspelling alphabets spanning typologically diverse sign languages, ASL, LIBRAS, Arabic Sign Language, and Thai Sign Language, the proposed angle features improve over normalized-coordinate baselines by up to 25 percentage points within-domain and enable frozen cross-lingual transfer that frequently exceeds within-domain accuracy, using a lightweight MLP encoder with about 10^5 parameters. These findings demonstrate that invariant hand-geometry descriptors provide a portable and effective foundation for cross-lingual few-shot SLR in low-resource settings.

Geometry-Aware Metric Learning for Cross-Lingual Few-Shot Sign Language Recognition on Static Hand Keypoints

TL;DR

It is demonstrated that invariant hand-geometry descriptors provide a portable and effective foundation for cross-lingual few-shot SLR in low-resource settings and enable frozen cross-lingual transfer that frequently exceeds within-domain accuracy.

Abstract

Sign language recognition (SLR) systems typically require large labeled corpora for each language, yet the majority of the world's 300+ sign languages lack sufficient annotated data. Cross-lingual few-shot transfer, pretraining on a data-rich source language and adapting with only a handful of target-language examples, offers a scalable alternative, but conventional coordinate-based keypoint representations are susceptible to domain shift arising from differences in camera viewpoint, hand scale, and recording conditions. This shift is particularly detrimental in the few-shot regime, where class prototypes estimated from only K examples are highly sensitive to extrinsic variance. We propose a geometry-aware metric-learning framework centered on a compact 20-dimensional inter-joint angle descriptor derived from MediaPipe static hand keypoints. These angles are invariant to SO(3) rotation, translation, and isotropic scaling, eliminating the dominant sources of cross-dataset shift and yielding tighter, more stable class prototypes. Evaluated on four fingerspelling alphabets spanning typologically diverse sign languages, ASL, LIBRAS, Arabic Sign Language, and Thai Sign Language, the proposed angle features improve over normalized-coordinate baselines by up to 25 percentage points within-domain and enable frozen cross-lingual transfer that frequently exceeds within-domain accuracy, using a lightweight MLP encoder with about 10^5 parameters. These findings demonstrate that invariant hand-geometry descriptors provide a portable and effective foundation for cross-lingual few-shot SLR in low-resource settings.
Paper Structure (45 sections, 5 equations, 3 figures, 9 tables)

This paper contains 45 sections, 5 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Pipeline overview. A hand image is processed by MediaPipe Hands into 21 keypoints (63-D when flattened), converted to one of three representations---raw (63-D), angle (20-D), or raw_angle (83-D)---and encoded into a 128-D embedding for Prototypical Network classification. In the cross-lingual setting, the encoder is either frozen or undergoes target-supervised adaptation (last-layer fine-tuning on the target train split).
  • Figure 2: MediaPipe Hands keypoint topology ($i \in \{0,\ldots,20\}$). Keypoint 0 is the wrist (root); each finger forms a four-joint kinematic chain following the official ordering zhang2020mediapipe: Thumb (1--4), Index (5--8), Middle (9--12), Ring (13--16), Pinky (17--20).
  • Figure 3: Error analysis on Thai (MLP / angle / 5-way 5-shot, 600 episodes). Left: Per-class accuracy (26--94%). Right: Top confused pairs.