Report on the 1st Workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) at SIGIR 2024
Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, Emine Yilmaz
TL;DR
The report presents the inaugural LLM4Eval workshop at SIGIR 2024, addressing the challenge of evaluating information retrieval in the era of large language models. It outlines the workshop organization, topics, and a program that combined keynotes, posters, and a panel to surface methodological and practical questions around LLM-based evaluation. The LLMJudge challenge provides a dataset-and-prompt framework to study label quality and agreement, highlighting issues of validity, randomness, and reproducibility in LLM-driven assessments, with implications for both academia and industry. Overall, the event signals growing interest and collaboration at the intersection of IR evaluation and generative AI, and it highlights concrete directions for robust, scalable, and fair evaluation protocols using LLMs.
Abstract
The first edition of the workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) took place in July 2024, co-located with the ACM SIGIR Conference 2024 in the USA (SIGIR 2024). The aim was to bring information retrieval researchers together around the topic of LLMs for evaluation in information retrieval that gathered attention with the advancement of large language models and generative AI. Given the novelty of the topic, the workshop was focused around multi-sided discussions, namely panels and poster sessions of the accepted proceedings papers.
