Table of Contents
Fetching ...

Evaluating Cost-Accuracy Trade-offs in Multimodal Search Relevance Judgements

Silvia Terragni, Hoang Cuong, Joachim Daiber, Pallavi Gudipati, Pablo N. Mendes

TL;DR

This paper assesses several LLMs and Multimodal Language Models in terms of their alignment with human judgments across multiple multimodal search scenarios and investigates the trade-offs between cost and accuracy.

Abstract

Large Language Models (LLMs) have demonstrated potential as effective search relevance evaluators. However, there is a lack of comprehensive guidance on which models consistently perform optimally across various contexts or within specific use cases. In this paper, we assess several LLMs and Multimodal Language Models (MLLMs) in terms of their alignment with human judgments across multiple multimodal search scenarios. Our analysis investigates the trade-offs between cost and accuracy, highlighting that model performance varies significantly depending on the context. Interestingly, in smaller models, the inclusion of a visual component may hinder performance rather than enhance it. These findings highlight the complexities involved in selecting the most appropriate model for practical applications.

Evaluating Cost-Accuracy Trade-offs in Multimodal Search Relevance Judgements

TL;DR

This paper assesses several LLMs and Multimodal Language Models in terms of their alignment with human judgments across multiple multimodal search scenarios and investigates the trade-offs between cost and accuracy.

Abstract

Large Language Models (LLMs) have demonstrated potential as effective search relevance evaluators. However, there is a lack of comprehensive guidance on which models consistently perform optimally across various contexts or within specific use cases. In this paper, we assess several LLMs and Multimodal Language Models (MLLMs) in terms of their alignment with human judgments across multiple multimodal search scenarios. Our analysis investigates the trade-offs between cost and accuracy, highlighting that model performance varies significantly depending on the context. Interestingly, in smaller models, the inclusion of a visual component may hinder performance rather than enhance it. These findings highlight the complexities involved in selecting the most appropriate model for practical applications.

Paper Structure

This paper contains 27 sections, 6 tables.