Approximating Human Preferences Using a Multi-Judge Learned System
Eitán Sprejer, Fernando Avalos, Augusto Bernardi, Jose Pedro Brito de Azevedo Faustino, Jacob Haimes, Narmeen Fatimah Oozeer
TL;DR
This work presents a persona-based, multi-judge framework to approximate human preferences by aggregating outputs from multiple rubric-conditioned judges. It introduces two learned aggregators, a Generalized Additive Model (GAM) and a Multi-Layer Perceptron (MLP), trained to predict synthetic ground-truth preferences generated by diverse personas. Experiments on the UltraFeedback dataset show learned aggregators outperform simple baselines by about 15% in $R^2$, with GAM offering interpretable judge contributions. The study also analyzes robustness to human biases and rubric perturbations, highlighting both the potential and limitations of synthetic-ground-truth approaches for scalable reward modeling and model routing in RLHF.
Abstract
Aligning LLM-based judges with human preferences is a significant challenge, as they are difficult to calibrate and often suffer from rubric sensitivity, bias, and instability. Overcoming this challenge advances key applications, such as creating reliable reward models for Reinforcement Learning from Human Feedback (RLHF) and building effective routing systems that select the best-suited model for a given user query. In this work, we propose a framework for modeling diverse, persona-based preferences by learning to aggregate outputs from multiple rubric-conditioned judges. We investigate the performance of this approach against naive baselines and assess its robustness through case studies on both human and LLM-judges biases. Our primary contributions include a persona-based method for synthesizing preference labels at scale and two distinct implementations of our aggregator: Generalized Additive Model (GAM) and a Multi-Layer Perceptron (MLP).
