Table of Contents
Fetching ...

SignX: Continuous Sign Recognition in Compact Pose-Rich Latent Space

Sen Fang, Yalin Feng, Chunyu Sui, Hongbin Zhong, Hongwei Yi, Dimitris N. Metaxas

TL;DR

SignX tackles continuous sign language recognition by learning in a compact pose-rich latent space that unifies five pose representations. It introduces a ViT-based Vid2Pose to extract latent pose from raw video and a two-stage training regime (Pose2Gloss followed by Video2Pose) to align pose with gloss outputs, complemented by latent-space temporal modeling and regularization. The approach achieves state-of-the-art results across major CSLR benchmarks with substantial efficiency gains, including faster inference and lower power consumption. This latent-space paradigm offers a scalable, data-efficient path for robust, real-time sign language recognition.

Abstract

The complexity of sign language data processing brings many challenges. The current approach to recognition of ASL signs aims to translate RGB sign language videos through pose information into English-based ID Glosses, which serve to uniquely identify ASL signs. This paper proposes SignX, a novel framework for continuous sign language recognition in compact pose-rich latent space. First, we construct a unified latent representation that encodes heterogeneous pose formats (SMPLer-X, DWPose, Mediapipe, PrimeDepth, and Sapiens Segmentation) into a compact, information-dense space. Second, we train a ViT-based Video2Pose module to extract this latent representation directly from raw videos. Finally, we develop a temporal modeling and sequence refinement method that operates entirely in this latent space. This multi-stage design achieves end-to-end sign language recognition while significantly reducing computational consumption. Experimental results demonstrate that SignX achieves state-of-the-art accuracy on continuous sign language recognition.

SignX: Continuous Sign Recognition in Compact Pose-Rich Latent Space

TL;DR

SignX tackles continuous sign language recognition by learning in a compact pose-rich latent space that unifies five pose representations. It introduces a ViT-based Vid2Pose to extract latent pose from raw video and a two-stage training regime (Pose2Gloss followed by Video2Pose) to align pose with gloss outputs, complemented by latent-space temporal modeling and regularization. The approach achieves state-of-the-art results across major CSLR benchmarks with substantial efficiency gains, including faster inference and lower power consumption. This latent-space paradigm offers a scalable, data-efficient path for robust, real-time sign language recognition.

Abstract

The complexity of sign language data processing brings many challenges. The current approach to recognition of ASL signs aims to translate RGB sign language videos through pose information into English-based ID Glosses, which serve to uniquely identify ASL signs. This paper proposes SignX, a novel framework for continuous sign language recognition in compact pose-rich latent space. First, we construct a unified latent representation that encodes heterogeneous pose formats (SMPLer-X, DWPose, Mediapipe, PrimeDepth, and Sapiens Segmentation) into a compact, information-dense space. Second, we train a ViT-based Video2Pose module to extract this latent representation directly from raw videos. Finally, we develop a temporal modeling and sequence refinement method that operates entirely in this latent space. This multi-stage design achieves end-to-end sign language recognition while significantly reducing computational consumption. Experimental results demonstrate that SignX achieves state-of-the-art accuracy on continuous sign language recognition.

Paper Structure

This paper contains 57 sections, 19 equations, 9 figures, 9 tables, 3 algorithms.

Figures (9)

  • Figure 1: Multimodal pose estimation methods: SMPLer-X cai2023smplerx can provide accurate 3D human body parameters; DWPose yang2023effective focuses on real-time 2D keypoint detection; Mediapipe MediaPipe provides lightweight but efficient 3D pose prediction; PrimeDepth zavadski2024primedepth can obtain scene depth information; while Sapiens Segmentation khirodkar2024sapiens provides fine-grained human body part segmentation results. These methods each have their own characteristics, providing rich feature representations for sign language recognition.
  • Figure 2: Overview of Constructing Compact Pose-Rich Latent Space: Overall, we utilize ViT dosovitskiy2020vit to construct and accommodate a pose latent space. It has two entry points: a video entry at the top layer and a pose data entry in the middle section. (a) For training stage 1, we first train the pose fusion layer to output simple text information, ensuring that the learned pose representations are meaningful. (b) For training stage 2, we freeze all other components and only learn how RGB videos can be correctly converted into our pose features. For inference, only RGB is used as input, so we must ensure that we can encode RGB inputs into our pose features.
  • Figure 3: Organize and conduct continuous recognition in the latent space:(a) For the latent space of the enriched poses, we further distill, compress and organize it. The number of features trained should be as aligned as possible with the number of Gloss. This can further enhance the upper limit and performance of our model. (b) Then, we develop it based on a BiLSTM method zhang2023sltunet, enabling it to perform continuous Sign recognition in this latent space, thereby achieving results far superior to previous works.
  • Figure 4: Ablation Study Results. Performance comparison across different 1k step model configurations on ASLLRP dev set. (a) Impact of full pipeline components. (b) Effect of pose feature integration. (c) Contribution of sign language recognition module. (d) Improvement from latent refinement. All metrics show that all the changes were necessary and effective. More results are available at Sec. \ref{['subsec:Training_Efficiency_of_Latent_Space']} and Sec. \ref{['subsec:Abl_of_Latent_Space']}.
  • Figure 5: Optimization efficiency in constructing the pose-rich latent space. This figure tracks the convergence of four critical loss components designed to shape the latent space. The synchronized decline of Text and Word Match losses demonstrates the successful encoding of sign semantics into the unified latent representation.
  • ...and 4 more figures