Table of Contents
Fetching ...

Evaluating Perspectival Biases in Cross-Modal Retrieval

Teerapol Saengsukhiran, Peerawat Chomphooyod, Narabodee Rodjananant, Chompakorn Chaksangchaichot, Patawee Prakrankamanant, Witthawin Sripheanpol, Pak Lovichit, Sarana Nutanong, Ekapol Chuangsuwanich

TL;DR

The paper tackles perspectival biases in cross-modal retrieval, identifying two distinct forms: prevalence bias in image-to-text retrieval and association bias in text-to-image retrieval. It introduces $DLBKL$, a rank-aware extension of $LBKL$, and the 3XCM benchmark with the Self-Preference Cultural Bias Score (SP) to quantify these biases. Through experiments across diverse model families (dense vision-language retrievers, cross-lingual alignments, and multilingual LLM-based embedders), the study shows explicit cross-lingual alignment markedly reduces both biases, while association bias remains challenging. The work provides practical metrics and datasets to evaluate fairness in multilingual multimodal systems and calls for training strategies that enforce global semantic mappings beyond data scale alone.

Abstract

Multimodal retrieval systems are expected to operate in a semantic space, agnostic to the language or cultural origin of the query. In practice, however, retrieval outcomes systematically reflect perspectival biases: deviations shaped by linguistic prevalence and cultural associations. We study two such biases. First, prevalence bias refers to the tendency to favor entries from prevalent languages over semantically faithful entries in image-to-text retrieval. Second, association bias refers to the tendency to favor images culturally associated with the query over semantically correct ones in text-to-image retrieval. Results show that explicit alignment is a more effective strategy for mitigating prevalence bias. However, association bias remains a distinct and more challenging problem. These findings suggest that achieving truly equitable multimodal systems requires targeted strategies beyond simple data scaling and that bias arising from cultural association may be treated as a more challenging problem than one arising from linguistic prevalence.

Evaluating Perspectival Biases in Cross-Modal Retrieval

TL;DR

The paper tackles perspectival biases in cross-modal retrieval, identifying two distinct forms: prevalence bias in image-to-text retrieval and association bias in text-to-image retrieval. It introduces , a rank-aware extension of , and the 3XCM benchmark with the Self-Preference Cultural Bias Score (SP) to quantify these biases. Through experiments across diverse model families (dense vision-language retrievers, cross-lingual alignments, and multilingual LLM-based embedders), the study shows explicit cross-lingual alignment markedly reduces both biases, while association bias remains challenging. The work provides practical metrics and datasets to evaluate fairness in multilingual multimodal systems and calls for training strategies that enforce global semantic mappings beyond data scale alone.

Abstract

Multimodal retrieval systems are expected to operate in a semantic space, agnostic to the language or cultural origin of the query. In practice, however, retrieval outcomes systematically reflect perspectival biases: deviations shaped by linguistic prevalence and cultural associations. We study two such biases. First, prevalence bias refers to the tendency to favor entries from prevalent languages over semantically faithful entries in image-to-text retrieval. Second, association bias refers to the tendency to favor images culturally associated with the query over semantically correct ones in text-to-image retrieval. Results show that explicit alignment is a more effective strategy for mitigating prevalence bias. However, association bias remains a distinct and more challenging problem. These findings suggest that achieving truly equitable multimodal systems requires targeted strategies beyond simple data scaling and that bias arising from cultural association may be treated as a more challenging problem than one arising from linguistic prevalence.

Paper Structure

This paper contains 33 sections, 4 equations, 19 figures, 11 tables.

Figures (19)

  • Figure 1: Two Forms of Perspectival Biases. (a) Prevalence bias: an image query favors high-resource languages. A retrieval model places English results above semantically equivalent Japanese and Thai captions. (b) Association Bias: A visualized model's embedding space, demonstrating how a Japanese text query for "necklace" retrieves culturally proximate images (Japanese masks) instead of the semantically correct one (Kenyan necklace).
  • Figure 2: Overview of the study. First, in RQ1, we identify language prevalence in image-to-text retrieval by analyzing the language of the retrieved text and comparing it with high-resource languages, as well as medium- and low-resource languages. Second, in RQ2, we identify the association bias using self-preference behavior of the model by retrieving an image with three candidates: semantically relevant, culturally relevant, and a non-relevant candidate.
  • Figure 3: Illustration of how DLBKL, unlike the rank-agnostic LBKL, assigns a higher bias score to lists where high-resource languages dominate the top ranks.
  • Figure 4: Overview of the XCM dataset creation process, designed to produce a benchmark with parallelism across semantics, cultures, and languages.
  • Figure 5: Illustration of association bias evaluation. A Thai text query for "food" is evaluated against three candidates designed to isolate semantic faithfulness vs. cultural relevance.
  • ...and 14 more figures