Table of Contents
Fetching ...

PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models

Mayank Nautiyal, Stela Arranz Gheorghe, Kristiana Stefa, Li Ju, Ida-Maria Sintorn, Prashant Singh

TL;DR

PARIC addresses the ill-posedness and multivaluedness of cross-modal mappings in language-guided image classification by introducing probabilistic attention guided by language. It extends GALS by adding ProbVLM-based adapters that convert embeddings to $GGD(\widehat{\mathbf{z}}, \alpha, \beta)$, sample $K$ instantiations to derive a reference attention map $A_{\mathrm{ref}}(x)$, and regularize the classifier via $\mathcal{L}_{\mathrm{att}}$ together with $\mathcal{L}_{\mathrm{cls}}$ to train $f_{\theta}$ with attention $A_{\theta}(x)$. The method shows improved accuracy, reduced variance, and robustness to bias/noise across MS-COCO, Waterbirds, and Food-101 datasets, with two aggregation schemes (mean and median) for the attention maps. These results indicate uncertainty-aware multimodal guidance can improve interpretability, fairness, and generalization when leveraging large vision-language foundations like CLIP.

Abstract

Language-guided attention frameworks have significantly enhanced both interpretability and performance in image classification; however, the reliance on deterministic embeddings from pre-trained vision-language foundation models to generate reference attention maps frequently overlooks the intrinsic multivaluedness and ill-posed characteristics of cross-modal mappings. To address these limitations, we introduce PARIC, a probabilistic framework for guiding visual attention via language specifications. Our approach enables pre-trained vision-language models to generate probabilistic reference attention maps, which align textual and visual modalities more effectively while incorporating uncertainty estimates, as compared to their deterministic counterparts. Experiments on benchmark test problems demonstrate that PARIC enhances prediction accuracy, mitigates bias, ensures consistent predictions, and improves robustness across various datasets.

PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models

TL;DR

PARIC addresses the ill-posedness and multivaluedness of cross-modal mappings in language-guided image classification by introducing probabilistic attention guided by language. It extends GALS by adding ProbVLM-based adapters that convert embeddings to , sample instantiations to derive a reference attention map , and regularize the classifier via together with to train with attention . The method shows improved accuracy, reduced variance, and robustness to bias/noise across MS-COCO, Waterbirds, and Food-101 datasets, with two aggregation schemes (mean and median) for the attention maps. These results indicate uncertainty-aware multimodal guidance can improve interpretability, fairness, and generalization when leveraging large vision-language foundations like CLIP.

Abstract

Language-guided attention frameworks have significantly enhanced both interpretability and performance in image classification; however, the reliance on deterministic embeddings from pre-trained vision-language foundation models to generate reference attention maps frequently overlooks the intrinsic multivaluedness and ill-posed characteristics of cross-modal mappings. To address these limitations, we introduce PARIC, a probabilistic framework for guiding visual attention via language specifications. Our approach enables pre-trained vision-language models to generate probabilistic reference attention maps, which align textual and visual modalities more effectively while incorporating uncertainty estimates, as compared to their deterministic counterparts. Experiments on benchmark test problems demonstrate that PARIC enhances prediction accuracy, mitigates bias, ensures consistent predictions, and improves robustness across various datasets.

Paper Structure

This paper contains 22 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: PARIC Workflow. The classifier $f_{\theta}$ predicts the label $\hat{y}$ and generates an attention map $A_{\theta}(x)$ for input image $x$. The ProbVLM pipeline expects image-text pairs, so the label $y$ is first converted into text prompts. These text prompts, along with the image, $x$, are processed by frozen CLIP encoders $\Psi_T$ (text) and $\Psi_I$ (image) to produce deterministic embeddings $\mathbf{z}_T$ and $\mathbf{z}_I$. Trainable adapters $\Omega_T$ and $\Omega_I$ model these embeddings as Generalized Gaussian Distributions (GGDs), from which $K$ samples are drawn to compute similarity scores. Grad-CAM combines these scores with CLIP's image feature map $\mathcal{F}$ to generate $K$ saliency maps, which are aggregated using mean or median into a reference map $A_{\mathrm{ref}}(x)$. This map guides $A_{\theta}(x)$ via $\mathcal{L}_{\mathrm{att}}$, complementing $\mathcal{L}_{\mathrm{cls}}$ to improve the robustness and interpretability of the classifier.
  • Figure 2: Attention Map Visualization. for three instances: COCO and Waterbirds 100% (GALS vs. PARIC Mean) and Food-101 (GALS vs. PARIC Median). Each row shows the original image, the attention map from frozen CLIP, and the refined map after integrating probabilistic layers. The first two instances show the strengths of the probabilistic approach, where the attention maps are more accurate, while the case of Food-101 shows an experiment where PARIC performs worse, with the regularization being too strong and limiting.
  • Figure 3: Comparison of Effective vs. Poor Attention Maps on the Waterbirds 100% Dataset. Each subfigure presents three instances from the dataset: the first image is the original input, and the third image is the final attention map obtained after integrating probabilistic layers. In (a), the first two rows use PARIC Mean and the third uses PARIC Median, whereas in (b), the first row uses PARIC Mean and the subsequent rows use PARIC Median.
  • Figure 4: Attention maps for MSCOCO. In each row, the first image represents the original input, followed by the attention map from the frozen CLIP model. The third image shows the attention map obtained after integrating probabilistic layers and sampling 50 embedding using the mean aggregation method.