Table of Contents
Fetching ...

MASS: Overcoming Language Bias in Image-Text Matching

Jiwan Chung, Seungwon Lim, Sangkyu Lee, Youngjae Yu

TL;DR

This paper tackles language bias in image-text matching by introducing MASS, a training-free, PMI-based score that debiases token-level likelihoods from pretrained visual-language models. MASS computes $S_{\text{MASS}}(\mathbf{c},\mathbf{x}) = \frac{1}{l} \sum_{t=1}^{l} \log \frac{p_{\bar{\theta}}(x_t|x_{<t},\mathbf{c})}{p_{\bar{\theta}}(x_t|x_{<t})}$ and estimates the marginal term via a null image $\mathbf{c}_{\emptyset}$, effectively reducing linguistic priors while preserving linguistic structure. The authors demonstrate MASS’s effectiveness across color, number, and gender debiasing tasks, and show it improves linguistic complexity benchmarks such as Winoground and SVO-Probes without retraining. This approach enhances image-text retrieval robustness and fairness, with practical impact for off-the-shelf visual-language systems.

Abstract

Pretrained visual-language models have made significant advancements in multimodal tasks, including image-text retrieval. However, a major challenge in image-text matching lies in language bias, where models predominantly rely on language priors and neglect to adequately consider the visual content. We thus present Multimodal ASsociation Score (MASS), a framework that reduces the reliance on language priors for better visual accuracy in image-text matching problems. It can be seamlessly incorporated into existing visual-language models without necessitating additional training. Our experiments have shown that MASS effectively lessens language bias without losing an understanding of linguistic compositionality. Overall, MASS offers a promising solution for enhancing image-text matching performance in visual-language models.

MASS: Overcoming Language Bias in Image-Text Matching

TL;DR

This paper tackles language bias in image-text matching by introducing MASS, a training-free, PMI-based score that debiases token-level likelihoods from pretrained visual-language models. MASS computes and estimates the marginal term via a null image , effectively reducing linguistic priors while preserving linguistic structure. The authors demonstrate MASS’s effectiveness across color, number, and gender debiasing tasks, and show it improves linguistic complexity benchmarks such as Winoground and SVO-Probes without retraining. This approach enhances image-text retrieval robustness and fairness, with practical impact for off-the-shelf visual-language systems.

Abstract

Pretrained visual-language models have made significant advancements in multimodal tasks, including image-text retrieval. However, a major challenge in image-text matching lies in language bias, where models predominantly rely on language priors and neglect to adequately consider the visual content. We thus present Multimodal ASsociation Score (MASS), a framework that reduces the reliance on language priors for better visual accuracy in image-text matching problems. It can be seamlessly incorporated into existing visual-language models without necessitating additional training. Our experiments have shown that MASS effectively lessens language bias without losing an understanding of linguistic compositionality. Overall, MASS offers a promising solution for enhancing image-text matching performance in visual-language models.
Paper Structure (21 sections, 9 equations, 7 figures, 7 tables)

This paper contains 21 sections, 9 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Captions retrieved with each method given the image, where only MASS succeeds in ruling out the failure modes. Models trained Image-Text Contrastive (ITC) objectives such as CLIP Radford2021LearningTV fail to model linguistic structure. Token Likelihood (TL) of image captioning models including OFA wang2022ofa shows overreliance on its language prior. Our MASS amends the language bias of image captioning models for accurate image-text matching capability.
  • Figure 2: (a) Given an image of a girl playing tennis, the visual-language model falsely retrieves captions describing the subject as male by relying on language bias. (b) On the other hand, MASS reduces such gender bias by adopting pointwise mutual information which suppresses the text-only marginal likelihood. We provide the corresponding experimental results in \ref{['subsec:exp_gender']}.
  • Figure 3: Data samples and experimental results from our color debiasing experiment.
  • Figure 4: The Pareto frontier of recall-bias trade-off in COCO-captions. Y-axis (gender bias) is inverted for better visualization.
  • Figure 5: Top 5 COCO Captions image-to-text retrieval results, sorted by decreasing retrieval score. The token likelihood score (TL) avoids associating female keywords with sports activities, by refusing to retrieve captions with clearly correct gender information in (a), or preferring to classify the skier as a man when the visual features possibly indicate otherwise (b), or hallucinate the gender information of the rider not visible in the image in (c). MASS reduces such gender bias in its retrieved caption examples.
  • ...and 2 more figures