Table of Contents
Fetching ...

SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval

Longtao Jiang, Min Wang, Zecheng Li, Yao Fang, Wengang Zhou, Houqiang Li

TL;DR

Sign language retrieval requires semantic understanding of actions within video clips. The authors introduce SEDS, a Semantically Enhanced Dual-Stream Encoder that fuses an online Pose encoder with an offline RGB encoder through Cross Gloss Attention Fusion and a Pose-RGB fine-grained matching objective, enabling end-to-end training with lightweight components. The method explicitly models both local gesture details and global visual semantics, using three cross-modal alignment losses and a joint objective to robustly align Pose, RGB, and text representations. Across How2Sign, PHOENIX-2014T, and CSL-Daily, SEDS achieves state-of-the-art results, demonstrating strong cross-modal retrieval performance and robustness to signer and background variations.

Abstract

Different from traditional video retrieval, sign language retrieval is more biased towards understanding the semantic information of human actions contained in video clips. Previous works typically only encode RGB videos to obtain high-level semantic features, resulting in local action details drowned in a large amount of visual information redundancy. Furthermore, existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training, and adopt offline RGB encoder instead, leading to suboptimal feature representation. To address these issues, we propose a novel sign language representation framework called Semantically Enhanced Dual-Stream Encoder (SEDS), which integrates Pose and RGB modalities to represent the local and global information of sign language videos. Specifically, the Pose encoder embeds the coordinates of keypoints corresponding to human joints, effectively capturing detailed action features. For better context-aware fusion of two video modalities, we propose a Cross Gloss Attention Fusion (CGAF) module to aggregate the adjacent clip features with similar semantic information from intra-modality and inter-modality. Moreover, a Pose-RGB Fine-grained Matching Objective is developed to enhance the aggregated fusion feature by contextual matching of fine-grained dual-stream features. Besides the offline RGB encoder, the whole framework only contains learnable lightweight networks, which can be trained end-to-end. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods on various datasets.

SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval

TL;DR

Sign language retrieval requires semantic understanding of actions within video clips. The authors introduce SEDS, a Semantically Enhanced Dual-Stream Encoder that fuses an online Pose encoder with an offline RGB encoder through Cross Gloss Attention Fusion and a Pose-RGB fine-grained matching objective, enabling end-to-end training with lightweight components. The method explicitly models both local gesture details and global visual semantics, using three cross-modal alignment losses and a joint objective to robustly align Pose, RGB, and text representations. Across How2Sign, PHOENIX-2014T, and CSL-Daily, SEDS achieves state-of-the-art results, demonstrating strong cross-modal retrieval performance and robustness to signer and background variations.

Abstract

Different from traditional video retrieval, sign language retrieval is more biased towards understanding the semantic information of human actions contained in video clips. Previous works typically only encode RGB videos to obtain high-level semantic features, resulting in local action details drowned in a large amount of visual information redundancy. Furthermore, existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training, and adopt offline RGB encoder instead, leading to suboptimal feature representation. To address these issues, we propose a novel sign language representation framework called Semantically Enhanced Dual-Stream Encoder (SEDS), which integrates Pose and RGB modalities to represent the local and global information of sign language videos. Specifically, the Pose encoder embeds the coordinates of keypoints corresponding to human joints, effectively capturing detailed action features. For better context-aware fusion of two video modalities, we propose a Cross Gloss Attention Fusion (CGAF) module to aggregate the adjacent clip features with similar semantic information from intra-modality and inter-modality. Moreover, a Pose-RGB Fine-grained Matching Objective is developed to enhance the aggregated fusion feature by contextual matching of fine-grained dual-stream features. Besides the offline RGB encoder, the whole framework only contains learnable lightweight networks, which can be trained end-to-end. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods on various datasets.
Paper Structure (15 sections, 7 equations, 7 figures, 6 tables)

This paper contains 15 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The difference in properties of video information between (a) traditional text-to-video retrieval and (b) sign language text-to-video retrieval. The former describes the content in video frames, while the latter corresponds to the semantics of the actions in video clips.
  • Figure 2: The overview of our Semantically Enhanced Dual-Stream Encoder framework, which consists of three parts: 1) Pose and RGB Feature Extraction Module. 2) Cross Gloss Attention Fusion (CGAF) Module. 3) Pose-RGB Fine-grained Matching Objective. The network is jointly optimized by: 1) three text-video alignment losses between Pose, RGB, Fusion features and Text features. 2) an RGB-Pose fine-grained alignment loss between Pose modality and RGB modality.
  • Figure 3: The structure of two-stream Cross Gloss Attention Fusion (CGAF) module.
  • Figure 4: The illustration of Pose-RGB fine-grained matching objective. The red box represents the softmax operation on the corresponding element.
  • Figure 5: The ablation study on How2Sign to investigate the influence of $\alpha$ and $\beta$. Where $\alpha$=0.8 in (a) and $\beta$=0.4 in (b).
  • ...and 2 more figures