Table of Contents
Fetching ...

The NGT200 Dataset: Geometric Multi-View Isolated Sign Recognition

Oline Ranum, David R. Wessels, Gomer Otterspeer, Erik J. Bekkers, Floris Roelofsen, Jari I. Andersen

TL;DR

This work introduces the NGT200 dataset to study multi-view isolated sign recognition (MV-ISR) with explicit emphasis on geometric and 3D-aware representations. It demonstrates that MV-ISR is distinct from single-view ISR by showing view-angle sensitivity and the benefits of incorporating multiple views, synthetic data, and geometric inductive biases. Methodologically, the study benchmarks a Sign Language Graph Convolution Network (SL-GCN) on reduced pose graphs and then advances to a SE(2)-equivariant temporal-PONITA model, achieving higher accuracy and stability. The findings suggest that multi-view pose-based approaches, augmented with synthetic data and geometry-informed models, offer a scalable and privacy-preserving pathway toward robust sign language recognition, with implications for real-world MV-SLR systems and future dataset expansions.

Abstract

Sign Language Processing (SLP) provides a foundation for a more inclusive future in language technology; however, the field faces several significant challenges that must be addressed to achieve practical, real-world applications. This work addresses multi-view isolated sign recognition (MV-ISR), and highlights the essential role of 3D awareness and geometry in SLP systems. We introduce the NGT200 dataset, a novel spatio-temporal multi-view benchmark, establishing MV-ISR as distinct from single-view ISR (SV-ISR). We demonstrate the benefits of synthetic data and propose conditioning sign representations on spatial symmetries inherent in sign language. Leveraging an SE(2) equivariant model improves MV-ISR performance by 8%-22% over the baseline.

The NGT200 Dataset: Geometric Multi-View Isolated Sign Recognition

TL;DR

This work introduces the NGT200 dataset to study multi-view isolated sign recognition (MV-ISR) with explicit emphasis on geometric and 3D-aware representations. It demonstrates that MV-ISR is distinct from single-view ISR by showing view-angle sensitivity and the benefits of incorporating multiple views, synthetic data, and geometric inductive biases. Methodologically, the study benchmarks a Sign Language Graph Convolution Network (SL-GCN) on reduced pose graphs and then advances to a SE(2)-equivariant temporal-PONITA model, achieving higher accuracy and stability. The findings suggest that multi-view pose-based approaches, augmented with synthetic data and geometry-informed models, offer a scalable and privacy-preserving pathway toward robust sign language recognition, with implications for real-world MV-SLR systems and future dataset expansions.

Abstract

Sign Language Processing (SLP) provides a foundation for a more inclusive future in language technology; however, the field faces several significant challenges that must be addressed to achieve practical, real-world applications. This work addresses multi-view isolated sign recognition (MV-ISR), and highlights the essential role of 3D awareness and geometry in SLP systems. We introduce the NGT200 dataset, a novel spatio-temporal multi-view benchmark, establishing MV-ISR as distinct from single-view ISR (SV-ISR). We demonstrate the benefits of synthetic data and propose conditioning sign representations on spatial symmetries inherent in sign language. Leveraging an SE(2) equivariant model improves MV-ISR performance by 8%-22% over the baseline.
Paper Structure (38 sections, 9 figures, 10 tables)

This paper contains 38 sections, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Configuration of video capture setup with the signCollect platform: each camera is positioned 4 meters away from the signer, with a 25$^\circ$ separation between cameras.
  • Figure 2: Spatio-temporal point clouds extracted with MediaPipe, displaying the front and right view. White landmarks represent a single frame, while blue landmarks indicate temporal dynamics across multiple frames. Dashed lines connect the landmarks purely for visual enhancement and do not reflect elements in the dataset.
  • Figure 3: The frequency of each handshape type within the NGT200 vocabulary, categorized by strong (dominant hand) and weak (non-dominant hand).
  • Figure 4: The reduced spatial graph used in our experiments. The graph reflects a simplified human skeleton using 27 nodes: 10 nodes per hand, and 7 nodes for the overall pose position. Spatial edges connect nodes to approximate the human bone structure.
  • Figure 5: Performance evaluation scheme using k-fold cross-validation across three distinct test sets. Accuracy scores (Acc) are computed for each fold within a test set, and the average accuracy (Avg) is calculated across all folds.
  • ...and 4 more figures