Graph Network for Sign Language Tasks
Shiwei Gan, Yafeng Yin, Zhiwei Jiang, Hongkai Wen, Lei Xie, Sanglu Lu
TL;DR
The paper presents MixSignGraph, a graph-based framework for sign language tasks that treats sign frames as graphs and learns intra-frame, inter-frame, and hierarchical features through three modules: Local Sign Graph, Temporal Sign Graph, and Hierarchical Sign Graph. A patch-based backbone with multiscale graph learning is paired with a Text-driven CTC Pre-training approach to enable gloss-free SLT. Across five public datasets and multiple tasks (CSLR and SLT), the method achieves state-of-the-art or competitive results without relying on extra cues, with the TCP scheme significantly narrowing the gloss-free SLT gap to gloss-based models. The work demonstrates the practicality of graph-based representations for sign language and provides extensive ablations and qualitative analyses to validate the effectiveness of the proposed components. Overall, MixSignGraph offers a robust, scalable path to improved sign language understanding and translation in real-world settings.
Abstract
Recent advances in sign language research have benefited from CNN-based backbones, which are primarily transferred from traditional computer vision tasks (\eg object identification, image recognition). However, these CNN-based backbones usually excel at extracting features like contours and texture, but may struggle with capturing sign-related features. In fact, sign language tasks require focusing on sign-related regions, including the collaboration between different regions (\eg left hand region and right hand region) and the effective content in a single region. To capture such region-related features, we introduce MixSignGraph, which represents sign sequences as a group of mixed graphs and designs the following three graph modules for feature extraction, \ie Local Sign Graph (LSG) module, Temporal Sign Graph (TSG) module and Hierarchical Sign Graph (HSG) module. Specifically, the LSG module learns the correlation of intra-frame cross-region features within one frame, \ie focusing on spatial features. The TSG module tracks the interaction of inter-frame cross-region features among adjacent frames, \ie focusing on temporal features. The HSG module aggregates the same-region features from different-granularity feature maps of a frame, \ie focusing on hierarchical features. In addition, to further improve the performance of sign language tasks without gloss annotations, we propose a simple yet counter-intuitive Text-driven CTC Pre-training (TCP) method, which generates pseudo gloss labels from text labels for model pre-training. Extensive experiments conducted on current five public sign language datasets demonstrate the superior performance of the proposed model. Notably, our model surpasses the SOTA models on multiple sign language tasks across several datasets, without relying on any additional cues.
