Table of Contents
Fetching ...

Uncertainty-aware sign language video retrieval with probability distribution modeling

Xuan Wu, Hongxiang Li, Yuanjiang Luo, Xuxin Cheng, Xianwei Zhuang, Meng Cao, Keren Fu

TL;DR

This work tackles uncertainty in sign language video retrieval by reframing sign video and text as probability distributions and aligning them through Optimal Transport. By modeling each modality as a multivariate Gaussian with learnable mean $\mu$ and variance $\sigma^2$, and applying Monte Carlo sampling, UPRet enables flexible one-to-many mappings that better capture gesture polysemy. OT provides a principled, minimum-cost alignment between the resulting distributions, yielding a distribution-level cross-modal loss that improves fine-grained matching. Across How2Sign, PHOENIX-2014T, and CSL-Daily, UPRet achieves state-of-the-art results, driven by distribution modeling, probabilistic representations, and entropy-regularized OT, with practical implications for more robust, accessible sign-language retrieval systems.

Abstract

Sign language video retrieval plays a key role in facilitating information access for the deaf community. Despite significant advances in video-text retrieval, the complexity and inherent uncertainty of sign language preclude the direct application of these techniques. Previous methods achieve the mapping between sign language video and text through fine-grained modal alignment. However, due to the scarcity of fine-grained annotation, the uncertainty inherent in sign language video is underestimated, limiting the further development of sign language retrieval tasks. To address this challenge, we propose a novel Uncertainty-aware Probability Distribution Retrieval (UPRet), that conceptualizes the mapping process of sign language video and text in terms of probability distributions, explores their potential interrelationships, and enables flexible mappings. Experiments on three benchmarks demonstrate the effectiveness of our method, which achieves state-of-the-art results on How2Sign (59.1%), PHOENIX-2014T (72.0%), and CSL-Daily (78.4%).

Uncertainty-aware sign language video retrieval with probability distribution modeling

TL;DR

This work tackles uncertainty in sign language video retrieval by reframing sign video and text as probability distributions and aligning them through Optimal Transport. By modeling each modality as a multivariate Gaussian with learnable mean and variance , and applying Monte Carlo sampling, UPRet enables flexible one-to-many mappings that better capture gesture polysemy. OT provides a principled, minimum-cost alignment between the resulting distributions, yielding a distribution-level cross-modal loss that improves fine-grained matching. Across How2Sign, PHOENIX-2014T, and CSL-Daily, UPRet achieves state-of-the-art results, driven by distribution modeling, probabilistic representations, and entropy-regularized OT, with practical implications for more robust, accessible sign-language retrieval systems.

Abstract

Sign language video retrieval plays a key role in facilitating information access for the deaf community. Despite significant advances in video-text retrieval, the complexity and inherent uncertainty of sign language preclude the direct application of these techniques. Previous methods achieve the mapping between sign language video and text through fine-grained modal alignment. However, due to the scarcity of fine-grained annotation, the uncertainty inherent in sign language video is underestimated, limiting the further development of sign language retrieval tasks. To address this challenge, we propose a novel Uncertainty-aware Probability Distribution Retrieval (UPRet), that conceptualizes the mapping process of sign language video and text in terms of probability distributions, explores their potential interrelationships, and enables flexible mappings. Experiments on three benchmarks demonstrate the effectiveness of our method, which achieves state-of-the-art results on How2Sign (59.1%), PHOENIX-2014T (72.0%), and CSL-Daily (78.4%).
Paper Structure (17 sections, 15 equations, 4 figures, 7 tables)

This paper contains 17 sections, 15 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: (a) Illustration of the uncertainty between sign language video and text. (b) Previous methods obtain single-point representations through one-to-one mappings, which make it difficult to capture one-to-many relationships in semantic space and thus present challenges in dealing with uncertainty in sign language scenarios. (c) Our method re-models representations in terms of probability distributions to better deal with uncertainty.
  • Figure 2: Overview of UPRet. Sign language video features and text features are extracted by the sign encoder and text encoder, and subsequently, we model the video distribution and text distribution. Finally, after randomly sampling the distribution, the distance of the distribution is measured using optimal transport to encourage fine-grained alignment.
  • Figure 3: Effect of the hyper-parameters on How2Sign dataset the number of samples in distribution modeling and the weights in optimal transport
  • Figure 4: Visualization of the text-sign video output on the How2Sign.Red: incorrect results of the baseline model. Green: correct results of our method.