LLM-as-a-Judge: Toward World Models for Slate Recommendation Systems
Baptiste Bonin, Maxime Heuillet, Audrey Durand
TL;DR
The paper tackles the challenge of offline evaluation for slate recommendation by proposing Large Language Models as world-model judges that perform pairwise slate comparisons based on short user histories. It formalizes slate-level preferences, defines an empirical regret objective to quantify alignment with user utility, and develops a prompt-design plus bias-mitigation pipeline to reliably elicit pairwise judgments from diverse LLM families. Across three tasks and multiple datasets, the authors demonstrate that LLM judges exhibit coherence properties that correlate with preference consistency and generally outperform random baselines, highlighting the potential of LLM-based world models as domain-agnostic substitutes for traditional simulators in offline evaluation. The work provides a transferable evaluation framework and actionable insights into when and how LLMs can reliably assess slate preferences, with implications for scalable, cross-domain recommender research.
Abstract
Modeling user preferences across domains remains a key challenge in slate recommendation (i.e. recommending an ordered sequence of items) research. We investigate how Large Language Models (LLM) can effectively act as world models of user preferences through pairwise reasoning over slates. We conduct an empirical study involving several LLMs on three tasks spanning different datasets. Our results reveal relationships between task performance and properties of the preference function captured by LLMs, hinting towards areas for improvement and highlighting the potential of LLMs as world models in recommender systems.
