Table of Contents
Fetching ...

TCNet: Continuous Sign Language Recognition from Trajectories and Correlated Regions

Hui Lu, Albert Ali Salah, Ronald Poppe

TL;DR

TCNet addresses continuous sign language recognition by introducing two novel modules: a trajectory module that aligns temporal movements and enables self-attention along motion trajectories, and a correlation module that performs dynamic, region-focused sparse attention to filter irrelevant regions. These modules are integrated into a hybrid CNN–attention architecture that can use various backbones and reduces computation while improving recognition accuracy. Across four large CSLR datasets, TCNet delivers state-of-the-art WER improvements and demonstrates robustness through ablations and backbone experiments. The approach provides practical benefits for scalable CSLR in real-world video settings and is accompanied by code for reproducibility.

Abstract

A key challenge in continuous sign language recognition (CSLR) is to efficiently capture long-range spatial interactions over time from the video input. To address this challenge, we propose TCNet, a hybrid network that effectively models spatio-temporal information from Trajectories and Correlated regions. TCNet's trajectory module transforms frames into aligned trajectories composed of continuous visual tokens. In addition, for a query token, self-attention is learned along the trajectory. As such, our network can also focus on fine-grained spatio-temporal patterns, such as finger movements, of a specific region in motion. TCNet's correlation module uses a novel dynamic attention mechanism that filters out irrelevant frame regions. Additionally, it assigns dynamic key-value tokens from correlated regions to each query. Both innovations significantly reduce the computation cost and memory. We perform experiments on four large-scale datasets: PHOENIX14, PHOENIX14-T, CSL, and CSL-Daily, respectively. Our results demonstrate that TCNet consistently achieves state-of-the-art performance. For example, we improve over the previous state-of-the-art by 1.5% and 1.0% word error rate on PHOENIX14 and PHOENIX14-T, respectively.

TCNet: Continuous Sign Language Recognition from Trajectories and Correlated Regions

TL;DR

TCNet addresses continuous sign language recognition by introducing two novel modules: a trajectory module that aligns temporal movements and enables self-attention along motion trajectories, and a correlation module that performs dynamic, region-focused sparse attention to filter irrelevant regions. These modules are integrated into a hybrid CNN–attention architecture that can use various backbones and reduces computation while improving recognition accuracy. Across four large CSLR datasets, TCNet delivers state-of-the-art WER improvements and demonstrates robustness through ablations and backbone experiments. The approach provides practical benefits for scalable CSLR in real-world video settings and is accompanied by code for reproducibility.

Abstract

A key challenge in continuous sign language recognition (CSLR) is to efficiently capture long-range spatial interactions over time from the video input. To address this challenge, we propose TCNet, a hybrid network that effectively models spatio-temporal information from Trajectories and Correlated regions. TCNet's trajectory module transforms frames into aligned trajectories composed of continuous visual tokens. In addition, for a query token, self-attention is learned along the trajectory. As such, our network can also focus on fine-grained spatio-temporal patterns, such as finger movements, of a specific region in motion. TCNet's correlation module uses a novel dynamic attention mechanism that filters out irrelevant frame regions. Additionally, it assigns dynamic key-value tokens from correlated regions to each query. Both innovations significantly reduce the computation cost and memory. We perform experiments on four large-scale datasets: PHOENIX14, PHOENIX14-T, CSL, and CSL-Daily, respectively. Our results demonstrate that TCNet consistently achieves state-of-the-art performance. For example, we improve over the previous state-of-the-art by 1.5% and 1.0% word error rate on PHOENIX14 and PHOENIX14-T, respectively.
Paper Structure (16 sections, 6 equations, 5 figures, 7 tables)

This paper contains 16 sections, 6 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Trajectories and correlated regions. The sign is revealed by the trajectories of both hands in Seq. 1, and the right hand and head in Seq. 2. The content of these regions, such as the hand pose, is also important. We visualize the trajectory of the right hand (red) and indicate correlated regions (yellow).
  • Figure 2: TCNet architecture with feature extractor, sequential modeling and classifier. TCNet blocks with our trajectory and correlation modules extract spatio-temporal features at various sequential stages.
  • Figure 3: Calculation of location map $L$ and encoding map $E$ in the trajectory module. For each frame, regions are traced back. Coordinates of a region in previous frames are stored at the same location. The location map is passes an encoder to provide encoding map $E$.
  • Figure 4: Calculation of the sparse attention matrix $I^{r}$ in the correlation module. In this example, our input is a $4 \times 4$ image and $K = 2$. The frame is split into four regions R1-R4, which pass through a $1 \times 1$ convolution to yield queries and keys. For R1 (other regions analogous), the left branch provides affinity matrix $A^{R1}$, while the right branch generates binary gate map $g$. Both volumes are multiplied element-wisely and pruned to produce sparse attention matrix $I^{R1}$. By combining these results for all regions, we get sparse and dynamic affinity matrix $I^R$ for the current frame.
  • Figure 5: Grad-CAM heatmaps for CorrNet and TCNet with only the Trajectory module, only the Correlation module, and with both modules.