MASS: Overcoming Language Bias in Image-Text Matching
Jiwan Chung, Seungwon Lim, Sangkyu Lee, Youngjae Yu
TL;DR
This paper tackles language bias in image-text matching by introducing MASS, a training-free, PMI-based score that debiases token-level likelihoods from pretrained visual-language models. MASS computes $S_{\text{MASS}}(\mathbf{c},\mathbf{x}) = \frac{1}{l} \sum_{t=1}^{l} \log \frac{p_{\bar{\theta}}(x_t|x_{<t},\mathbf{c})}{p_{\bar{\theta}}(x_t|x_{<t})}$ and estimates the marginal term via a null image $\mathbf{c}_{\emptyset}$, effectively reducing linguistic priors while preserving linguistic structure. The authors demonstrate MASS’s effectiveness across color, number, and gender debiasing tasks, and show it improves linguistic complexity benchmarks such as Winoground and SVO-Probes without retraining. This approach enhances image-text retrieval robustness and fairness, with practical impact for off-the-shelf visual-language systems.
Abstract
Pretrained visual-language models have made significant advancements in multimodal tasks, including image-text retrieval. However, a major challenge in image-text matching lies in language bias, where models predominantly rely on language priors and neglect to adequately consider the visual content. We thus present Multimodal ASsociation Score (MASS), a framework that reduces the reliance on language priors for better visual accuracy in image-text matching problems. It can be seamlessly incorporated into existing visual-language models without necessitating additional training. Our experiments have shown that MASS effectively lessens language bias without losing an understanding of linguistic compositionality. Overall, MASS offers a promising solution for enhancing image-text matching performance in visual-language models.
