Table of Contents
Fetching ...

Toward Automatic Relevance Judgment using Vision--Language Models for Image--Text Retrieval Evaluation

Jheng-Hong Yang, Jimmy Lin

TL;DR

The paper investigates automatic relevance judgments for image–text retrieval using Vision–Language Models (VLMs) in a zero-shot setting, comparing CLIP, LLaVA, and GPT-4V on the TREC-AToMiC 2023 collection. It framed relevance estimation as a pointwise score $\mathcal{F}(q,d)\in\mathbb{R}$, mapped to graded judgments, and evaluated how well model-based qrels align with human judgments via $\tau$, $\rho_s$, $\rho_p$, and $\kappa$. Results show LLM-powered VLMs outperform the CLIP-based baseline in ranking correlations (e.g., $\tau\approx0.4$ for $\mathrm{NDCG@10}$ and $\approx0.5$ for $\mathrm{MAP}$) and GPT-4V provides distributions closest to human judgments ($\kappa\approx0.08$ vs $-0.096$ for CLIP-S), yet evaluation bias toward CLIP-based systems remains a concern. The work demonstrates the potential of LLM-enhanced VLMs for scalable, automatic relevance judgments while highlighting biases and calibration challenges that call for further research and robust prompting strategies.

Abstract

Vision--Language Models (VLMs) have demonstrated success across diverse applications, yet their potential to assist in relevance judgments remains uncertain. This paper assesses the relevance estimation capabilities of VLMs, including CLIP, LLaVA, and GPT-4V, within a large-scale \textit{ad hoc} retrieval task tailored for multimedia content creation in a zero-shot fashion. Preliminary experiments reveal the following: (1) Both LLaVA and GPT-4V, encompassing open-source and closed-source visual-instruction-tuned Large Language Models (LLMs), achieve notable Kendall's $τ\sim 0.4$ when compared to human relevance judgments, surpassing the CLIPScore metric. (2) While CLIPScore is strongly preferred, LLMs are less biased towards CLIP-based retrieval systems. (3) GPT-4V's score distribution aligns more closely with human judgments than other models, achieving a Cohen's $κ$ value of around 0.08, which outperforms CLIPScore at approximately -0.096. These findings underscore the potential of LLM-powered VLMs in enhancing relevance judgments.

Toward Automatic Relevance Judgment using Vision--Language Models for Image--Text Retrieval Evaluation

TL;DR

The paper investigates automatic relevance judgments for image–text retrieval using Vision–Language Models (VLMs) in a zero-shot setting, comparing CLIP, LLaVA, and GPT-4V on the TREC-AToMiC 2023 collection. It framed relevance estimation as a pointwise score , mapped to graded judgments, and evaluated how well model-based qrels align with human judgments via , , , and . Results show LLM-powered VLMs outperform the CLIP-based baseline in ranking correlations (e.g., for and for ) and GPT-4V provides distributions closest to human judgments ( vs for CLIP-S), yet evaluation bias toward CLIP-based systems remains a concern. The work demonstrates the potential of LLM-enhanced VLMs for scalable, automatic relevance judgments while highlighting biases and calibration challenges that call for further research and robust prompting strategies.

Abstract

Vision--Language Models (VLMs) have demonstrated success across diverse applications, yet their potential to assist in relevance judgments remains uncertain. This paper assesses the relevance estimation capabilities of VLMs, including CLIP, LLaVA, and GPT-4V, within a large-scale \textit{ad hoc} retrieval task tailored for multimedia content creation in a zero-shot fashion. Preliminary experiments reveal the following: (1) Both LLaVA and GPT-4V, encompassing open-source and closed-source visual-instruction-tuned Large Language Models (LLMs), achieve notable Kendall's when compared to human relevance judgments, surpassing the CLIPScore metric. (2) While CLIPScore is strongly preferred, LLMs are less biased towards CLIP-based retrieval systems. (3) GPT-4V's score distribution aligns more closely with human judgments than other models, achieving a Cohen's value of around 0.08, which outperforms CLIPScore at approximately -0.096. These findings underscore the potential of LLM-powered VLMs in enhancing relevance judgments.
Paper Structure (18 sections, 1 equation, 3 figures, 3 tables)

This paper contains 18 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Scatter plots of effectiveness (NDCG@10) for TREC-AToMiC 2023 runs using human-based and model-based qrels. Each data point represents the mean effectiveness of a single run evaluated with different qrels. CLIP-based runs are highlighted in red. Best viewed in color.
  • Figure 2: Cumulative distribution function (CDF) plot of relevance scores from various models. Human stands for relevance annotations of NIST qrels.
  • Figure 3: Confusion matrices comparing human-based and model-based qrels. Tick labels 0/1/2 represent Non-relevant/Related/Relevant graded levels. Best viewed in color.