Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

Kasra Hosseini; Thomas Kober; Josip Krapac; Roland Vollgraf; Weiwei Cheng; Ana Peleteiro Ramallo

Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

Kasra Hosseini, Thomas Kober, Josip Krapac, Roland Vollgraf, Weiwei Cheng, Ana Peleteiro Ramallo

TL;DR

This paper proposes a framework for assessing the product search engines in a large-scale e-commerce setting, leveraging Multimodal LLMs for generating tailored annotation guidelines for individual queries, and conducting the subsequent annotation task.

Abstract

Evaluating production-level retrieval systems at scale is a crucial yet challenging task due to the limited availability of a large pool of well-trained human annotators. Large Language Models (LLMs) have the potential to address this scaling issue and offer a viable alternative to humans for the bulk of annotation tasks. In this paper, we propose a framework for assessing the product search engines in a large-scale e-commerce setting, leveraging Multimodal LLMs for (i) generating tailored annotation guidelines for individual queries, and (ii) conducting the subsequent annotation task. Our method, validated through deployment on a large e-commerce platform, demonstrates comparable quality to human annotations, significantly reduces time and cost, facilitates rapid problem discovery, and provides an effective solution for production-level quality control at scale.

Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

TL;DR

Abstract

Paper Structure (13 sections, 8 figures, 1 table)

This paper contains 13 sections, 8 figures, 1 table.

Introduction
Multimodal LLM-based relevance assessment
Experiments and Results
Dataset
LLM versus Human Annotators
Discussion
Conclusion
Ethics Statement
Multimodal LLM-powered relevance assessment: evaluation steps for an example query
Human Annotation Guidelines
Experiments with LLM types: GPT-3.5, GPT-4, and GPT-4o
LLM versus Human error types
Subjective Nature of Relevance Judgements

Figures (8)

Figure 1: Our proposed framework works by extracting a query-product pair from our search query-click logs (1). The query (e.g. black sneakers) is then passed on to the LLM generator (2). The LLM generator creates specific annotation instructions for the given query. The query-specific annotation guidelines and the query-product pair (e.g. black sneakers and the retrieved product) are provided as input to the LLM annotator (3). Lastly, the annotated query-product pair is forwarded to the search engine evaluation module (4).
Figure 2: Our proposed Multimodal LLM-powered framework enables offline evaluation of large-scale product retrieval systems and presents significant time and cost reductions compared to existing evaluation techniques. Refer to Fig. \ref{['fig:llm_annotation_overview']} for an overview of the main steps in the framework, and consult the text for further details. The orange rectangle indicates where a "one-step" Multimodal LLM (MLLM) could be utilised, instead of employing one MLLM to create a textual description for image inputs (Step 4) followed by an LLM (Step 5a). In the one-step MLLM, both textual descriptions and the product image are directly fed into the LLM annotator, along with query requirements and query-specific annotation guidelines. The depiction of the pipeline is simplified for readability.
Figure 3: Agreements between (M)LLM and the human annotator groups (i.e., A1 and A2). We compare agreements based on i) matching either A1 or A2 and ii) inter annotator agreement between human annotators (A1 vs. A2) and between LLMs and the human majority vote. In the A1 or A2 column, we use the same human majority vote to measure the agreements for human annotators. Results are reported separately for English and German. For human annotations, we report the total time and cost. We use GPT-4o in all steps of our LLM annotation pipeline (Fig. \ref{['fig:pipeline']}). Refer to Table \ref{['tab:llm_vs_human_gpt4o_with_without_more_columns']} for a more detailed comparison between human annotator groups (A1, A2, and tiebreaker) and different versions of our LLM-powered framework.
Figure 4: Distribution of errors between LLMs and humans on hard disagreements (50% were due to human errors, 31% LLM errors and in 19% both made an error). The upper part ("Both errors") focuses on errors that either the LLM or humans could make. It highlights that LLMs and humans make very different types of errors. In addition, the lower part ("LLM errors") shows the distribution of errors that only an LLM would make. Predominantly these are misunderstandings of a part of the search query.
Figure 5: Evaluation steps for an example query women's long sleeve t-shirt with green stripes. The entire content displayed in this figure is generated by Multimodal LLMs, except for panel (a), the packshot in panel (d), and the black dashed rectangle also in panel (d). However, within the attributes shown in panel (d), the "visual description of packshot", highlighted by a red rectangle, is also generated by a vision model (specifically, GPT-4o was used in this instance). Please refer to the text for further details. (In this example, we have removed the brand name from the product description and the tag on the packshot.).
...and 3 more figures

Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

TL;DR

Abstract

Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)