Uncertainty-aware sign language video retrieval with probability distribution modeling
Xuan Wu, Hongxiang Li, Yuanjiang Luo, Xuxin Cheng, Xianwei Zhuang, Meng Cao, Keren Fu
TL;DR
This work tackles uncertainty in sign language video retrieval by reframing sign video and text as probability distributions and aligning them through Optimal Transport. By modeling each modality as a multivariate Gaussian with learnable mean $\mu$ and variance $\sigma^2$, and applying Monte Carlo sampling, UPRet enables flexible one-to-many mappings that better capture gesture polysemy. OT provides a principled, minimum-cost alignment between the resulting distributions, yielding a distribution-level cross-modal loss that improves fine-grained matching. Across How2Sign, PHOENIX-2014T, and CSL-Daily, UPRet achieves state-of-the-art results, driven by distribution modeling, probabilistic representations, and entropy-regularized OT, with practical implications for more robust, accessible sign-language retrieval systems.
Abstract
Sign language video retrieval plays a key role in facilitating information access for the deaf community. Despite significant advances in video-text retrieval, the complexity and inherent uncertainty of sign language preclude the direct application of these techniques. Previous methods achieve the mapping between sign language video and text through fine-grained modal alignment. However, due to the scarcity of fine-grained annotation, the uncertainty inherent in sign language video is underestimated, limiting the further development of sign language retrieval tasks. To address this challenge, we propose a novel Uncertainty-aware Probability Distribution Retrieval (UPRet), that conceptualizes the mapping process of sign language video and text in terms of probability distributions, explores their potential interrelationships, and enables flexible mappings. Experiments on three benchmarks demonstrate the effectiveness of our method, which achieves state-of-the-art results on How2Sign (59.1%), PHOENIX-2014T (72.0%), and CSL-Daily (78.4%).
