Table of Contents
Fetching ...

OLMD: Orientation-aware Long-term Motion Decoupling for Continuous Sign Language Recognition

Yiheng Yu, Sheng Liu, Yuan Feng, Min Xu, Zhelun Jin, Xuhua Yang

TL;DR

OLMD tackles continuous sign language recognition by addressing long-term, multi-orientational motions through Long-term Motion Aggregation and orientation-aware decoupling. The framework decouples motion into horizontal and vertical components, purifies orientation-specific cues, and uses stage and cross-stage coupling to fuse multi-scale features, enabling robust temporal modeling with CTC and self-distillation losses. It delivers state-of-the-art results on PHOENIX14, PHOENIX14-T, and CSL-Daily, including significant absolute WER reductions on challenging signs, demonstrating improved handling of complex, long-range gestures. The approach offers practical, real-time CSLR benefits by enhancing motion capture, orientation discrimination, and multi-scale feature integration, establishing a strong baseline for future work.

Abstract

The primary challenge in continuous sign language recognition (CSLR) mainly stems from the presence of multi-orientational and long-term motions. However, current research overlooks these crucial aspects, significantly impacting accuracy. To tackle these issues, we propose a novel CSLR framework: Orientation-aware Long-term Motion Decoupling (OLMD), which efficiently aggregates long-term motions and decouples multi-orientational signals into easily interpretable components. Specifically, our innovative Long-term Motion Aggregation (LMA) module filters out static redundancy while adaptively capturing abundant features of long-term motions. We further enhance orientation awareness by decoupling complex movements into horizontal and vertical components, allowing for motion purification in both orientations. Additionally, two coupling mechanisms are proposed: stage and cross-stage coupling, which together enrich multi-scale features and improve the generalization capabilities of the model. Experimentally, OLMD shows SOTA performance on three large-scale datasets: PHOENIX14, PHOENIX14-T, and CSL-Daily. Notably, we improved the word error rate (WER) on PHOENIX14 by an absolute 1.6% compared to the previous SOTA

OLMD: Orientation-aware Long-term Motion Decoupling for Continuous Sign Language Recognition

TL;DR

OLMD tackles continuous sign language recognition by addressing long-term, multi-orientational motions through Long-term Motion Aggregation and orientation-aware decoupling. The framework decouples motion into horizontal and vertical components, purifies orientation-specific cues, and uses stage and cross-stage coupling to fuse multi-scale features, enabling robust temporal modeling with CTC and self-distillation losses. It delivers state-of-the-art results on PHOENIX14, PHOENIX14-T, and CSL-Daily, including significant absolute WER reductions on challenging signs, demonstrating improved handling of complex, long-range gestures. The approach offers practical, real-time CSLR benefits by enhancing motion capture, orientation discrimination, and multi-scale feature integration, establishing a strong baseline for future work.

Abstract

The primary challenge in continuous sign language recognition (CSLR) mainly stems from the presence of multi-orientational and long-term motions. However, current research overlooks these crucial aspects, significantly impacting accuracy. To tackle these issues, we propose a novel CSLR framework: Orientation-aware Long-term Motion Decoupling (OLMD), which efficiently aggregates long-term motions and decouples multi-orientational signals into easily interpretable components. Specifically, our innovative Long-term Motion Aggregation (LMA) module filters out static redundancy while adaptively capturing abundant features of long-term motions. We further enhance orientation awareness by decoupling complex movements into horizontal and vertical components, allowing for motion purification in both orientations. Additionally, two coupling mechanisms are proposed: stage and cross-stage coupling, which together enrich multi-scale features and improve the generalization capabilities of the model. Experimentally, OLMD shows SOTA performance on three large-scale datasets: PHOENIX14, PHOENIX14-T, and CSL-Daily. Notably, we improved the word error rate (WER) on PHOENIX14 by an absolute 1.6% compared to the previous SOTA

Paper Structure

This paper contains 17 sections, 15 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: (a) We illustrate several common glosses with relevant video pieces from the CSL-Daily, highlighting the extensive presence of long-term and multi-orientational motions in sign language (yellow and red arrows indicate motion orientations), which traditional CSLR models struggle to manage. (b) OLMD Performance Comparison on the PHOENIX14: surpassing SOTA models by a large margin.
  • Figure 2: An overview of the proposed OLMD. After each stage of the Feature Extractor, frame-wise features first aggregate long-term motion information via LMA, which is then decoupled into horizontal and vertical components. HMP (Horizontal Motion Purification) and VMP (Vertical Motion Purification) are subsequently applied to enhance orientation-specific motion awareness. Stage and cross-stage coupling leverage enhanced features within and across stages, ensuring the integrity of decoupling-coupling while enriching the utilization of multi-scale features. Finally, two 1D-CNNs share architecture for downsampling and local temporal modeling, while the BiLSTM is used for global temporal modeling.
  • Figure 3: Details of the decoupling and stage-coupling are shown. (a) and (b) represent different designs of the OMP.
  • Figure 4: The main idea of our Long-term Motion Aggregation (LMA) module, illustrated with a context length of 5.
  • Figure 5: Heatmap visualizations of the LMA module using Grad-CAM gradcam. Obviously, LMA can effectively suppress static information (blue regions) and focus on motion areas (red and yellow regions, indicating moving hands and facial changes).
  • ...and 1 more figures