SignX: Continuous Sign Recognition in Compact Pose-Rich Latent Space
Sen Fang, Yalin Feng, Chunyu Sui, Hongbin Zhong, Hongwei Yi, Dimitris N. Metaxas
TL;DR
SignX tackles continuous sign language recognition by learning in a compact pose-rich latent space that unifies five pose representations. It introduces a ViT-based Vid2Pose to extract latent pose from raw video and a two-stage training regime (Pose2Gloss followed by Video2Pose) to align pose with gloss outputs, complemented by latent-space temporal modeling and regularization. The approach achieves state-of-the-art results across major CSLR benchmarks with substantial efficiency gains, including faster inference and lower power consumption. This latent-space paradigm offers a scalable, data-efficient path for robust, real-time sign language recognition.
Abstract
The complexity of sign language data processing brings many challenges. The current approach to recognition of ASL signs aims to translate RGB sign language videos through pose information into English-based ID Glosses, which serve to uniquely identify ASL signs. This paper proposes SignX, a novel framework for continuous sign language recognition in compact pose-rich latent space. First, we construct a unified latent representation that encodes heterogeneous pose formats (SMPLer-X, DWPose, Mediapipe, PrimeDepth, and Sapiens Segmentation) into a compact, information-dense space. Second, we train a ViT-based Video2Pose module to extract this latent representation directly from raw videos. Finally, we develop a temporal modeling and sequence refinement method that operates entirely in this latent space. This multi-stage design achieves end-to-end sign language recognition while significantly reducing computational consumption. Experimental results demonstrate that SignX achieves state-of-the-art accuracy on continuous sign language recognition.
