Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval
Hao Li, Jingkuan Song, Lianli Gao, Xiaosu Zhu, Heng Tao Shen
TL;DR
Cross-modal retrieval often suffers from aleatoric uncertainty due to data quality, leading to unreliable similarity predictions. The paper introduces PAU, a prototype-based evidential framework that represents each modality with $K$ learnable prototypes and maps prototype–instance similarities to Dirichlet parameters via Dempster-Shafer Theory and Subjective Logic, enabling an explicit uncertainty score $u$ and belief masses $b_k$. It couples uncertainty modeling with dedicated losses (uncertainty and diversity) and a re-ranking step to produce more trustworthy predictions, validated on MSR-VTT, MSVD, DiDeMo, and MS-COCO where PAU consistently improves retrieval performance, especially under noisy conditions. The approach provides a practical, scalable mechanism to quantify and leverage data ambiguity in cross-modal learning, with potential to improve data selection and pretraining efficiency for large multi-modal systems.
Abstract
Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space. However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity. Concretely, we first construct a set of various learnable prototypes for each modality to represent the entire semantics subspace. Then Dempster-Shafer Theory and Subjective Logic Theory are utilized to build an evidential theoretical framework by associating evidence with Dirichlet Distribution parameters. The PAU model induces accurate uncertainty and reliable predictions for cross-modal retrieval. Extensive experiments are performed on four major benchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the effectiveness of our method. The code is accessible at https://github.com/leolee99/PAU.
