Table of Contents
Fetching ...

Cross-Modal Consistency Learning for Sign Language Recognition

Kepeng Wu, Zecheng Li, Hezhen Hu, Wengang Zhou, Houqiang Li

TL;DR

CCL-SLR addresses ISLR pre-training by enforcing cross-modal consistency between RGB and pose representations through dual-branch contrastive learning. It introduces Motion-Preserving Masking (MPM) to suppress non-semantic background in RGB videos and Semantic Positive Mining (SPM) to supply cross-modal pseudo-labels, enabling robust cross-modal alignment during pre-training. The method delivers strong improvements across four ISLR benchmarks, surpassing several prior approaches on MSASL, WLASL, NMFs-CSL, and SLR500, and demonstrates the practicality of large-scale, self-supervised cross-modal pre-training for sign language understanding. The work provides comprehensive ablations and analyses, showing that cross-modal losses, motion-aware augmentations, and semantic neighbor mining collectively enhance sign representation learning with potential impact on real-world sign-language systems.

Abstract

Pre-training has been proven to be effective in boosting the performance of Isolated Sign Language Recognition (ISLR). Existing pre-training methods solely focus on the compact pose data, which eliminates background perturbation but inevitably suffers from insufficient semantic cues compared to raw RGB videos. Nevertheless, learning representation directly from RGB videos remains challenging due to the presence of sign-independent visual features. To address this dilemma, we propose a Cross-modal Consistency Learning framework (CCL-SLR), which leverages the cross-modal consistency from both RGB and pose modalities based on self-supervised pre-training. First, CCL-SLR employs contrastive learning for instance discrimination within and across modalities. Through the single-modal and cross-modal contrastive learning, CCL-SLR gradually aligns the feature spaces of RGB and pose modalities, thereby extracting consistent sign representations. Second, we further introduce Motion-Preserving Masking (MPM) and Semantic Positive Mining (SPM) techniques to improve cross-modal consistency from the perspective of data augmentation and sample similarity, respectively. Extensive experiments on four ISLR benchmarks show that CCL-SLR achieves impressive performance, demonstrating its effectiveness. The code will be released to the public.

Cross-Modal Consistency Learning for Sign Language Recognition

TL;DR

CCL-SLR addresses ISLR pre-training by enforcing cross-modal consistency between RGB and pose representations through dual-branch contrastive learning. It introduces Motion-Preserving Masking (MPM) to suppress non-semantic background in RGB videos and Semantic Positive Mining (SPM) to supply cross-modal pseudo-labels, enabling robust cross-modal alignment during pre-training. The method delivers strong improvements across four ISLR benchmarks, surpassing several prior approaches on MSASL, WLASL, NMFs-CSL, and SLR500, and demonstrates the practicality of large-scale, self-supervised cross-modal pre-training for sign language understanding. The work provides comprehensive ablations and analyses, showing that cross-modal losses, motion-aware augmentations, and semantic neighbor mining collectively enhance sign representation learning with potential impact on real-world sign-language systems.

Abstract

Pre-training has been proven to be effective in boosting the performance of Isolated Sign Language Recognition (ISLR). Existing pre-training methods solely focus on the compact pose data, which eliminates background perturbation but inevitably suffers from insufficient semantic cues compared to raw RGB videos. Nevertheless, learning representation directly from RGB videos remains challenging due to the presence of sign-independent visual features. To address this dilemma, we propose a Cross-modal Consistency Learning framework (CCL-SLR), which leverages the cross-modal consistency from both RGB and pose modalities based on self-supervised pre-training. First, CCL-SLR employs contrastive learning for instance discrimination within and across modalities. Through the single-modal and cross-modal contrastive learning, CCL-SLR gradually aligns the feature spaces of RGB and pose modalities, thereby extracting consistent sign representations. Second, we further introduce Motion-Preserving Masking (MPM) and Semantic Positive Mining (SPM) techniques to improve cross-modal consistency from the perspective of data augmentation and sample similarity, respectively. Extensive experiments on four ISLR benchmarks show that CCL-SLR achieves impressive performance, demonstrating its effectiveness. The code will be released to the public.

Paper Structure

This paper contains 19 sections, 12 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Comparison between previous pre-training paradigms and ours. (a) Previous paradigms beststc-slrsignbertsignbert+ ignore modeling the interaction between RGB and pose modalities during pre-training. (b) Our proposed method effectively aligns embedding spaces of two modalities via cross-modal consistency learning.
  • Figure 2: (a) The CCL-SLR pre-training framework. RGB and pose sequences are processed through two single-modal branches to learn modality-specific representation. During this process, MPM is applied to mask motion-irrelevant regions in the RGB videos. Additionally, cross-modal positive samples are identified through SPM, which are utilized for single-modal loss $\mathcal{L}^R$ and $\mathcal{L}^L$. Finally, the consistency of RGB and pose embedding spaces is constrained through cross-modal contrastive loss $\mathcal{L}_{cross}$. (b) An overview of the Semantic Positive Mining. SPM selects the top-$k$ most similar samples from each modality-specific memory bank and merges them to get pseudo-labels.
  • Figure 3: An overview of MPM. $h(\cdot)$ and $h^{-1}(\cdot)$ denote the transformation from pixel/latent space to latent/pixel space, respectively. The video is encoded into latent space using a flow-based generative model. Channels in the latent representation with low standard deviation along the temporal dimension are then masked out. The masked latent representation is decoded back into pixel space. The reconstructed frames are then processed through gray-scaling and binarization to produce motion-aware mask sequence.
  • Figure 4: Visualizations of heatmaps by Grad-CAM 8237336. Top: raw frames; Middle: heatmaps of the fine-tuning results of the RGB encoder from pre-training without $\mathcal{L}_{cross}$; Bottom: heatmaps of the fine-tuning results of the RGB encoder from pre-training with $\mathcal{L}_{cross}$. All samples are from the NMFs-CSL nmfs dataset.
  • Figure 5: Hyperparameter selection on CCL-SLR performance. We systematically investigate the impact of pre-training data scale, mask probability $\alpha$ in MPM, and number of pseudo-positive labels $k$ in SPM.
  • ...and 4 more figures