Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

Hao Li; Jingkuan Song; Lianli Gao; Xiaosu Zhu; Heng Tao Shen

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

Hao Li, Jingkuan Song, Lianli Gao, Xiaosu Zhu, Heng Tao Shen

TL;DR

Cross-modal retrieval often suffers from aleatoric uncertainty due to data quality, leading to unreliable similarity predictions. The paper introduces PAU, a prototype-based evidential framework that represents each modality with $K$ learnable prototypes and maps prototype–instance similarities to Dirichlet parameters via Dempster-Shafer Theory and Subjective Logic, enabling an explicit uncertainty score $u$ and belief masses $b_k$. It couples uncertainty modeling with dedicated losses (uncertainty and diversity) and a re-ranking step to produce more trustworthy predictions, validated on MSR-VTT, MSVD, DiDeMo, and MS-COCO where PAU consistently improves retrieval performance, especially under noisy conditions. The approach provides a practical, scalable mechanism to quantify and leverage data ambiguity in cross-modal learning, with potential to improve data selection and pretraining efficiency for large multi-modal systems.

Abstract

Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space. However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity. Concretely, we first construct a set of various learnable prototypes for each modality to represent the entire semantics subspace. Then Dempster-Shafer Theory and Subjective Logic Theory are utilized to build an evidential theoretical framework by associating evidence with Dirichlet Distribution parameters. The PAU model induces accurate uncertainty and reliable predictions for cross-modal retrieval. Extensive experiments are performed on four major benchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the effectiveness of our method. The code is accessible at https://github.com/leolee99/PAU.

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

TL;DR

learnable prototypes and maps prototype–instance similarities to Dirichlet parameters via Dempster-Shafer Theory and Subjective Logic, enabling an explicit uncertainty score

and belief masses

. It couples uncertainty modeling with dedicated losses (uncertainty and diversity) and a re-ranking step to produce more trustworthy predictions, validated on MSR-VTT, MSVD, DiDeMo, and MS-COCO where PAU consistently improves retrieval performance, especially under noisy conditions. The approach provides a practical, scalable mechanism to quantify and leverage data ambiguity in cross-modal learning, with potential to improve data selection and pretraining efficiency for large multi-modal systems.

Abstract

Paper Structure (24 sections, 26 equations, 9 figures, 11 tables)

This paper contains 24 sections, 26 equations, 9 figures, 11 tables.

Introduction
Related work
Cross-modal Retrieval
Uncertainty Quantification
Method
Task Definition
Uncertainty Quantification
Experiments
Datasets and Metrics
Implementation Details
Comparison with State of the Arts
Comparison with Probabilistic Model
Ablation Studies
Conclusion
The Model Structure of the Baselines
...and 9 more sections

Figures (9)

Figure 1: Illustration of confused matching in fast-paced videos and non-detailed texts. Assuming the possible semantics of each modal subspace are finite with $K$ categories. (a) A single-scene Video A can only match one semantics of "talking". By contrast, a multi-scene Video B can match to 3 semantics of "talking", "shadow", and "cave". (b) Text A can only match the left video, while Text B with some details removed (in red) matches both videos.
Figure 2: The Framework of PAU. The visual encoder $\phi_v$ and textual encoder $\phi_t$ separately map the visual and textual instances into a joint embedding space to calculate the similarity matrix $\textbf{M}$. A dot product function is used to build a set of similarity vector $\mathbf{P} \in \mathbbm{R} ^{N \times K}$ between $N$ instances and $K$ prototypes, afterward modeling the uncertainty. An uncertainty loss forces the prototypes into learning the rich semantics of subspace to realize accurate uncertainty quantification. Besides, A diversity loss is introduced to keep prototypes diverse. $\otimes$ means cosine similarity.
Figure 3: The performance changes comparison after removing top-r instances with the highest uncertainty scores quantified by PCME and PAU on MS-COCO. To fairly compare, we employ the removal on both predictions arising from CLIP and PCME. (a) and (b) show the performance changes on CLIP predictions. (c) and (d) show the performance changes on PCME predictions. In i2t, text instances are removed. In t2i, image instances are removed.
Figure 4: The performance comparison against data removal number on MSR-VTT.
Figure 5: The performance comparison against uncertain data number on MSR-VTT.
...and 4 more figures

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

TL;DR

Abstract

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (9)