Table of Contents
Fetching ...

LLM-as-a-Judge: Toward World Models for Slate Recommendation Systems

Baptiste Bonin, Maxime Heuillet, Audrey Durand

TL;DR

The paper tackles the challenge of offline evaluation for slate recommendation by proposing Large Language Models as world-model judges that perform pairwise slate comparisons based on short user histories. It formalizes slate-level preferences, defines an empirical regret objective to quantify alignment with user utility, and develops a prompt-design plus bias-mitigation pipeline to reliably elicit pairwise judgments from diverse LLM families. Across three tasks and multiple datasets, the authors demonstrate that LLM judges exhibit coherence properties that correlate with preference consistency and generally outperform random baselines, highlighting the potential of LLM-based world models as domain-agnostic substitutes for traditional simulators in offline evaluation. The work provides a transferable evaluation framework and actionable insights into when and how LLMs can reliably assess slate preferences, with implications for scalable, cross-domain recommender research.

Abstract

Modeling user preferences across domains remains a key challenge in slate recommendation (i.e. recommending an ordered sequence of items) research. We investigate how Large Language Models (LLM) can effectively act as world models of user preferences through pairwise reasoning over slates. We conduct an empirical study involving several LLMs on three tasks spanning different datasets. Our results reveal relationships between task performance and properties of the preference function captured by LLMs, hinting towards areas for improvement and highlighting the potential of LLMs as world models in recommender systems.

LLM-as-a-Judge: Toward World Models for Slate Recommendation Systems

TL;DR

The paper tackles the challenge of offline evaluation for slate recommendation by proposing Large Language Models as world-model judges that perform pairwise slate comparisons based on short user histories. It formalizes slate-level preferences, defines an empirical regret objective to quantify alignment with user utility, and develops a prompt-design plus bias-mitigation pipeline to reliably elicit pairwise judgments from diverse LLM families. Across three tasks and multiple datasets, the authors demonstrate that LLM judges exhibit coherence properties that correlate with preference consistency and generally outperform random baselines, highlighting the potential of LLM-based world models as domain-agnostic substitutes for traditional simulators in offline evaluation. The work provides a transferable evaluation framework and actionable insights into when and how LLMs can reliably assess slate preferences, with implications for scalable, cross-domain recommender research.

Abstract

Modeling user preferences across domains remains a key challenge in slate recommendation (i.e. recommending an ordered sequence of items) research. We investigate how Large Language Models (LLM) can effectively act as world models of user preferences through pairwise reasoning over slates. We conduct an empirical study involving several LLMs on three tasks spanning different datasets. Our results reveal relationships between task performance and properties of the preference function captured by LLMs, hinting towards areas for improvement and highlighting the potential of LLMs as world models in recommender systems.

Paper Structure

This paper contains 26 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Distribution of empirical regret across models for each dataset/task, with dataset similarity.
  • Figure 2: Empirical regret against axioms of coherence for each model in each dataset/task.
  • Figure 3: Overview of the LLM-based evaluation pipeline. Each user history produces candidate slates that are compared pairwise by a language model acting as a world model judge. The model receives a structured prompt containing user context and both slates, then outputs a pairwise preference. These pairwise outcomes are aggregated into coherence metrics that validate transitivity, asymmetry, and rational consistency.
  • Figure 4: Coherence metrics across all tasks and models. Higher scores indicate stronger consistency with preference axioms.