Table of Contents
Fetching ...

CorrNet+: Sign Language Recognition and Translation via Spatial-Temporal Correlation

Lianyu Hu, Wei Feng, Liqing Gao, Zekang Liu, Liang Wan

TL;DR

CorrNet+ addresses sign language understanding by explicitly modeling cross-frame body trajectories through a lightweight spatial-temporal correlation framework. It combines a correlation module, an identification module, and a temporal attention module to produce trajectory-aware features that enhance CSLR and SLT performance while reducing computational cost. The approach achieves state-of-the-art results on multiple benchmarks, notably outperforming pose-based and heatmap-reliant methods while using RGB inputs only. Ablations and visualizations confirm that focusing on hands and face movements across extended temporal windows yields robust, efficient sign-language representations with practical impact for real-world deployments.

Abstract

In sign language, the conveyance of human body trajectories predominantly relies upon the coordinated movements of hands and facial expressions across successive frames. Despite the recent advancements of sign language understanding methods, they often solely focus on individual frames, inevitably overlooking the inter-frame correlations that are essential for effectively modeling human body trajectories. To address this limitation, this paper introduces a spatial-temporal correlation network, denoted as CorrNet+, which explicitly identifies body trajectories across multiple frames. In specific, CorrNet+ employs a correlation module and an identification module to build human body trajectories. Afterwards, a temporal attention module is followed to adaptively evaluate the contributions of different frames. The resultant features offer a holistic perspective on human body movements, facilitating a deeper understanding of sign language. As a unified model, CorrNet+ achieves new state-of-the-art performance on two extensive sign language understanding tasks, including continuous sign language recognition (CSLR) and sign language translation (SLT). Especially, CorrNet+ surpasses previous methods equipped with resource-intensive pose-estimation networks or pre-extracted heatmaps for hand and facial feature extraction. Compared with CorrNet, CorrNet+ achieves a significant performance boost across all benchmarks while halving the computational overhead. A comprehensive comparison with previous spatial-temporal reasoning methods verifies the superiority of CorrNet+. Code is available at https://github.com/hulianyuyy/CorrNet_Plus.

CorrNet+: Sign Language Recognition and Translation via Spatial-Temporal Correlation

TL;DR

CorrNet+ addresses sign language understanding by explicitly modeling cross-frame body trajectories through a lightweight spatial-temporal correlation framework. It combines a correlation module, an identification module, and a temporal attention module to produce trajectory-aware features that enhance CSLR and SLT performance while reducing computational cost. The approach achieves state-of-the-art results on multiple benchmarks, notably outperforming pose-based and heatmap-reliant methods while using RGB inputs only. Ablations and visualizations confirm that focusing on hands and face movements across extended temporal windows yields robust, efficient sign-language representations with practical impact for real-world deployments.

Abstract

In sign language, the conveyance of human body trajectories predominantly relies upon the coordinated movements of hands and facial expressions across successive frames. Despite the recent advancements of sign language understanding methods, they often solely focus on individual frames, inevitably overlooking the inter-frame correlations that are essential for effectively modeling human body trajectories. To address this limitation, this paper introduces a spatial-temporal correlation network, denoted as CorrNet+, which explicitly identifies body trajectories across multiple frames. In specific, CorrNet+ employs a correlation module and an identification module to build human body trajectories. Afterwards, a temporal attention module is followed to adaptively evaluate the contributions of different frames. The resultant features offer a holistic perspective on human body movements, facilitating a deeper understanding of sign language. As a unified model, CorrNet+ achieves new state-of-the-art performance on two extensive sign language understanding tasks, including continuous sign language recognition (CSLR) and sign language translation (SLT). Especially, CorrNet+ surpasses previous methods equipped with resource-intensive pose-estimation networks or pre-extracted heatmaps for hand and facial feature extraction. Compared with CorrNet, CorrNet+ achieves a significant performance boost across all benchmarks while halving the computational overhead. A comprehensive comparison with previous spatial-temporal reasoning methods verifies the superiority of CorrNet+. Code is available at https://github.com/hulianyuyy/CorrNet_Plus.
Paper Structure (21 sections, 15 equations, 11 figures, 11 tables)

This paper contains 21 sections, 15 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: (a) Illustration for the difference among the isolated sign language recognition (ISLR) task, continuous sign language recognition (CSLR) task and sign language translation (SLT) task. (b) Visualization of correlation maps with Grad-CAM selvaraju2017grad between the current frame and two adjacent frames in the left/right side. It's observed that without extra supervision, our method well attends to informative regions in adjacent frames to identify human body trajectories.
  • Figure 2: An overview for our CorrNet+, which can support both the CSLR task and the SLT task with a common base model. In this base model, it first employs a feature extractor (2D CNN) to capture frame-wise features, and then adopts a 1D CNN and a BiLSTM to perform short-term and long-term temporal modeling, respectively. For the CSLR task, we attach a classifier instantiated as a fully connected layer to perform classification. For the SLT task, we attach a VL-mapper instantiated as a MLP and a translation network to predict sentences. The feature extractor is consisted of multiple stages to extract spatial-wise features for each frame independently. After each stage of the feature extractor, we insert a correlation stage to capture cross-frame interactions. An identification module and a correlation module are first concurrently placed to identify body trajectories across adjacent frames, whose outputs are then element-wisely multiplied and fed into the temporal attention module to dynamically emphasize the key human body trajectories in the whole video.
  • Figure 3: Illustration for the difference between the correlation operator in CorrNet hu2023continuous and CorrNet+. (a) CorrNet hu2023continuous. It computes correlation maps between a spatial patch $p_t(i,j)$ in $x_t$ and all other patches in adjacent frame $x_{t+1}$ and $x_{t-1}$. The overall computation complexity is $O(H^2W^2)$, quadratic to the number of spatial patches in each frame, which incurs heavy extra computations. (b) To reduce computations, we condense the features of $x_t$ into several compact representations, which are then used to compute correlation maps with adjacent frames on behalf of $x_t$. In this case, as the number of selected patches is reduced from $H\times W$ to $O(1)$ for $x_t$, the computation complexity is drastically decreased from $O(H^2W^2)$ to $O(HW)$. It also enables us to compute correlation maps with neighbors in a larger temporal duration to more effectively capture the whole human body movements in expressing a sign.
  • Figure 4: An framework overview for our proposed correlation module. It first condenses each frame into a compact representation, and then uses it to compute correlation maps with adjacent frames within a predefined range of $L$ to model human body trajectories.
  • Figure 5: Illustration for our identification module. To avoid heavy computations in identifying informative spatial regions when modeling local spatial-temporal information, we decompose the spatial-temporal modeling structure along the spatial and temporal dimensions simultaneously to form a multiscale architecture, enlarging the model capacity.
  • ...and 6 more figures