Table of Contents
Fetching ...

Invisible Relevance Bias: Text-Image Retrieval Models Prefer AI-Generated Images

Shicheng Xu, Danyang Hou, Liang Pang, Jingcheng Deng, Jun Xu, Huawei Shen, Xueqi Cheng

TL;DR

The paper addresses the problem that AI-generated images can induce invisible relevance bias in text–image retrieval, biasing models to rank generated imagery higher despite similar semantics. It builds a benchmark pairing real and AI-generated images with semantically aligned prompts and develops a merged-caption, oversampling generation pipeline to simulate realistic mixed corpora. The main findings show that invisible relevance bias is pervasive across architectures and training regimes and worsens when AI-generated images appear in training data, creating a vicious cycle. A debiasing method based on a caption–image contrastive objective and controllable sampling reduces bias and reveals that AI-generated images embed additional information into image representations, which can be reversed by shifting representations, offering a mechanism to diagnose and mitigate the bias.

Abstract

With the advancement of generation models, AI-generated content (AIGC) is becoming more realistic, flooding the Internet. A recent study suggests that this phenomenon causes source bias in text retrieval for web search. Specifically, neural retrieval models tend to rank generated texts higher than human-written texts. In this paper, we extend the study of this bias to cross-modal retrieval. Firstly, we successfully construct a suitable benchmark to explore the existence of the bias. Subsequent extensive experiments on this benchmark reveal that AI-generated images introduce an invisible relevance bias to text-image retrieval models. Specifically, our experiments show that text-image retrieval models tend to rank the AI-generated images higher than the real images, even though the AI-generated images do not exhibit more visually relevant features to the query than real images. This invisible relevance bias is prevalent across retrieval models with varying training data and architectures. Furthermore, our subsequent exploration reveals that the inclusion of AI-generated images in the training data of the retrieval models exacerbates the invisible relevance bias. The above phenomenon triggers a vicious cycle, which makes the invisible relevance bias become more and more serious. To elucidate the potential causes of invisible relevance and address the aforementioned issues, we introduce an effective training method aimed at alleviating the invisible relevance bias. Subsequently, we apply our proposed debiasing method to retroactively identify the causes of invisible relevance, revealing that the AI-generated images induce the image encoder to embed additional information into their representation. This information exhibits a certain consistency across generated images with different semantics and can make the retriever estimate a higher relevance score.

Invisible Relevance Bias: Text-Image Retrieval Models Prefer AI-Generated Images

TL;DR

The paper addresses the problem that AI-generated images can induce invisible relevance bias in text–image retrieval, biasing models to rank generated imagery higher despite similar semantics. It builds a benchmark pairing real and AI-generated images with semantically aligned prompts and develops a merged-caption, oversampling generation pipeline to simulate realistic mixed corpora. The main findings show that invisible relevance bias is pervasive across architectures and training regimes and worsens when AI-generated images appear in training data, creating a vicious cycle. A debiasing method based on a caption–image contrastive objective and controllable sampling reduces bias and reveals that AI-generated images embed additional information into image representations, which can be reversed by shifting representations, offering a mechanism to diagnose and mitigate the bias.

Abstract

With the advancement of generation models, AI-generated content (AIGC) is becoming more realistic, flooding the Internet. A recent study suggests that this phenomenon causes source bias in text retrieval for web search. Specifically, neural retrieval models tend to rank generated texts higher than human-written texts. In this paper, we extend the study of this bias to cross-modal retrieval. Firstly, we successfully construct a suitable benchmark to explore the existence of the bias. Subsequent extensive experiments on this benchmark reveal that AI-generated images introduce an invisible relevance bias to text-image retrieval models. Specifically, our experiments show that text-image retrieval models tend to rank the AI-generated images higher than the real images, even though the AI-generated images do not exhibit more visually relevant features to the query than real images. This invisible relevance bias is prevalent across retrieval models with varying training data and architectures. Furthermore, our subsequent exploration reveals that the inclusion of AI-generated images in the training data of the retrieval models exacerbates the invisible relevance bias. The above phenomenon triggers a vicious cycle, which makes the invisible relevance bias become more and more serious. To elucidate the potential causes of invisible relevance and address the aforementioned issues, we introduce an effective training method aimed at alleviating the invisible relevance bias. Subsequently, we apply our proposed debiasing method to retroactively identify the causes of invisible relevance, revealing that the AI-generated images induce the image encoder to embed additional information into their representation. This information exhibits a certain consistency across generated images with different semantics and can make the retriever estimate a higher relevance score.
Paper Structure (22 sections, 9 equations, 5 figures, 6 tables)

This paper contains 22 sections, 9 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Bias found in our paper. IR models tend to rank AI-generated images higher than real images even though they have very similar semantics. This bias increases the likelihood of the generated images being exposed from massive data of internet, which makes them more likely to be mixed into the training of AIGC and retrieval models, leading to more serious bias and forming a vicious cycle.
  • Figure 2: Assessment results on the training set mixed with AI-generated images. We change the ratio of AI-generated images in the datasets (X-axis) while keeping the total number of training samples unchanged. The model is tested on the test set of Flicker30k+AI (in-domain) and MSCOCO+AI (out-of-domain) respectively that we constructed in Section \ref{['benchmark']}.
  • Figure 3: Distribution of the caption-image relevance scores estimated by retrieval models that are trained on the datasets mixed with different ratios of AI-generated images. Flicker30k is in-domain and MSCOCO is out-of-domain.
  • Figure 4: Distribution of the caption-image relevance scores estimated by retrieval models with different sampling probability $\beta$ in our debiasing method. Flicker30k is in-domain and MSCOCO is out-of-domain.
  • Figure 5: T-SNE visualization of image representations and transformations vector $\boldsymbol{p}$.