Table of Contents
Fetching ...

Mitigating Preference Leakage via Strict Estimator Separation for Normative Generative Ranking

Dalia Nahhas, Xiaohao Cai, Imran Razzak, Shoaib Jameel

TL;DR

The results demonstrate that rigorous evaluator separation is a prerequisite for credible GenIR evaluation, proving that subtle cultural preferences can be distilled into efficient rankers without leakage.

Abstract

In Generative Information Retrieval (GenIR), the bottleneck has shifted from generation to the selection of candidates, particularly for normative criteria such as cultural relevance. Current LLM-as-a-Judge evaluations often suffer from circularity and preference leakage, where overlapping supervision and evaluation models inflate performance. We address this by formalising cultural relevance as a within-query ranking task and introducing a leakage-free two-judge framework that strictly separates supervision (Judge B) from evaluation (Judge A). On a new benchmark of 33,052 (NGR-33k) culturally grounded stories, we find that while classical baselines yield only modest gains, a dense bi-encoder distilled from a Judge-B-supervised Cross-Encoder is highly effective. Although the Cross-Encoder provides a strong supervision signal for distillation, the distilled BGE-M3 model substantially outperforms it under leakage-free Judge~A evaluation. We validate our framework on the human-curated Moral Stories dataset, showing strong alignment with human norms. Our results demonstrate that rigorous evaluator separation is a prerequisite for credible GenIR evaluation, proving that subtle cultural preferences can be distilled into efficient rankers without leakage.

Mitigating Preference Leakage via Strict Estimator Separation for Normative Generative Ranking

TL;DR

The results demonstrate that rigorous evaluator separation is a prerequisite for credible GenIR evaluation, proving that subtle cultural preferences can be distilled into efficient rankers without leakage.

Abstract

In Generative Information Retrieval (GenIR), the bottleneck has shifted from generation to the selection of candidates, particularly for normative criteria such as cultural relevance. Current LLM-as-a-Judge evaluations often suffer from circularity and preference leakage, where overlapping supervision and evaluation models inflate performance. We address this by formalising cultural relevance as a within-query ranking task and introducing a leakage-free two-judge framework that strictly separates supervision (Judge B) from evaluation (Judge A). On a new benchmark of 33,052 (NGR-33k) culturally grounded stories, we find that while classical baselines yield only modest gains, a dense bi-encoder distilled from a Judge-B-supervised Cross-Encoder is highly effective. Although the Cross-Encoder provides a strong supervision signal for distillation, the distilled BGE-M3 model substantially outperforms it under leakage-free Judge~A evaluation. We validate our framework on the human-curated Moral Stories dataset, showing strong alignment with human norms. Our results demonstrate that rigorous evaluator separation is a prerequisite for credible GenIR evaluation, proving that subtle cultural preferences can be distilled into efficient rankers without leakage.
Paper Structure (12 sections, 7 equations, 2 figures, 9 tables)

This paper contains 12 sections, 7 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: An example of a query where dense bi-encoders and lexical models can struggle. They rank the stereotypical Candidate A higher due to token overlap, whereas the true cultural ground truth (Candidate B) requires reasoning about the enactment of the moral value.
  • Figure 2: Our novel framework separates the optimisation loop (Top, Blue) from the inference/evaluation loop (Bottom, Red). The Supervision Judge ($J_B$) generates noisy proxies $y_B$ for gradient descent ($\nabla_\theta$). The Independent Evaluator ($J_A$) provides labels $y_A$ exclusively for metric computation. The only bridge between the worlds is the frozen parameter set $\theta^*$, preventing circular preference leakage. $\hat{y} = f_{\theta^*}(s, q)$ is the framework's prediction.