PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models
Mayank Nautiyal, Stela Arranz Gheorghe, Kristiana Stefa, Li Ju, Ida-Maria Sintorn, Prashant Singh
TL;DR
PARIC addresses the ill-posedness and multivaluedness of cross-modal mappings in language-guided image classification by introducing probabilistic attention guided by language. It extends GALS by adding ProbVLM-based adapters that convert embeddings to $GGD(\widehat{\mathbf{z}}, \alpha, \beta)$, sample $K$ instantiations to derive a reference attention map $A_{\mathrm{ref}}(x)$, and regularize the classifier via $\mathcal{L}_{\mathrm{att}}$ together with $\mathcal{L}_{\mathrm{cls}}$ to train $f_{\theta}$ with attention $A_{\theta}(x)$. The method shows improved accuracy, reduced variance, and robustness to bias/noise across MS-COCO, Waterbirds, and Food-101 datasets, with two aggregation schemes (mean and median) for the attention maps. These results indicate uncertainty-aware multimodal guidance can improve interpretability, fairness, and generalization when leveraging large vision-language foundations like CLIP.
Abstract
Language-guided attention frameworks have significantly enhanced both interpretability and performance in image classification; however, the reliance on deterministic embeddings from pre-trained vision-language foundation models to generate reference attention maps frequently overlooks the intrinsic multivaluedness and ill-posed characteristics of cross-modal mappings. To address these limitations, we introduce PARIC, a probabilistic framework for guiding visual attention via language specifications. Our approach enables pre-trained vision-language models to generate probabilistic reference attention maps, which align textual and visual modalities more effectively while incorporating uncertainty estimates, as compared to their deterministic counterparts. Experiments on benchmark test problems demonstrate that PARIC enhances prediction accuracy, mitigates bias, ensures consistent predictions, and improves robustness across various datasets.
