Table of Contents
Fetching ...

Approximating Human Preferences Using a Multi-Judge Learned System

Eitán Sprejer, Fernando Avalos, Augusto Bernardi, Jose Pedro Brito de Azevedo Faustino, Jacob Haimes, Narmeen Fatimah Oozeer

TL;DR

This work presents a persona-based, multi-judge framework to approximate human preferences by aggregating outputs from multiple rubric-conditioned judges. It introduces two learned aggregators, a Generalized Additive Model (GAM) and a Multi-Layer Perceptron (MLP), trained to predict synthetic ground-truth preferences generated by diverse personas. Experiments on the UltraFeedback dataset show learned aggregators outperform simple baselines by about 15% in $R^2$, with GAM offering interpretable judge contributions. The study also analyzes robustness to human biases and rubric perturbations, highlighting both the potential and limitations of synthetic-ground-truth approaches for scalable reward modeling and model routing in RLHF.

Abstract

Aligning LLM-based judges with human preferences is a significant challenge, as they are difficult to calibrate and often suffer from rubric sensitivity, bias, and instability. Overcoming this challenge advances key applications, such as creating reliable reward models for Reinforcement Learning from Human Feedback (RLHF) and building effective routing systems that select the best-suited model for a given user query. In this work, we propose a framework for modeling diverse, persona-based preferences by learning to aggregate outputs from multiple rubric-conditioned judges. We investigate the performance of this approach against naive baselines and assess its robustness through case studies on both human and LLM-judges biases. Our primary contributions include a persona-based method for synthesizing preference labels at scale and two distinct implementations of our aggregator: Generalized Additive Model (GAM) and a Multi-Layer Perceptron (MLP).

Approximating Human Preferences Using a Multi-Judge Learned System

TL;DR

This work presents a persona-based, multi-judge framework to approximate human preferences by aggregating outputs from multiple rubric-conditioned judges. It introduces two learned aggregators, a Generalized Additive Model (GAM) and a Multi-Layer Perceptron (MLP), trained to predict synthetic ground-truth preferences generated by diverse personas. Experiments on the UltraFeedback dataset show learned aggregators outperform simple baselines by about 15% in , with GAM offering interpretable judge contributions. The study also analyzes robustness to human biases and rubric perturbations, highlighting both the potential and limitations of synthetic-ground-truth approaches for scalable reward modeling and model routing in RLHF.

Abstract

Aligning LLM-based judges with human preferences is a significant challenge, as they are difficult to calibrate and often suffer from rubric sensitivity, bias, and instability. Overcoming this challenge advances key applications, such as creating reliable reward models for Reinforcement Learning from Human Feedback (RLHF) and building effective routing systems that select the best-suited model for a given user query. In this work, we propose a framework for modeling diverse, persona-based preferences by learning to aggregate outputs from multiple rubric-conditioned judges. We investigate the performance of this approach against naive baselines and assess its robustness through case studies on both human and LLM-judges biases. Our primary contributions include a persona-based method for synthesizing preference labels at scale and two distinct implementations of our aggregator: Generalized Additive Model (GAM) and a Multi-Layer Perceptron (MLP).

Paper Structure

This paper contains 29 sections, 3 equations, 9 figures.

Figures (9)

  • Figure 1: Diagram of the system setup. Starting from prompt–answer pairs, we simulate human preference scores (True Preference Score) using a persona-parameterized evaluator (e.g., llama-3.1-405b; Simulated Human Feedback), and collect rubric-based scores from multiple judges (Judge {i}). We then train an aggregator $f(J)$) to predict the simulated preference scores from the judge scores.
  • Figure 2: Model Performance Comparison, a comprehensive evaluation across all aggregation methods. Key results: (1) MLP achieves best overall performance (R² = 0.578), (2) GAM provides comparable performance (R² = 0.575) with full interpretability, (3) Learned linear baselines (R² = 0.544) outperform naive methods, and (4) Single best judge performs significantly worse (R² = 0.353), validating the multi-judge approach.
  • Figure 3: GAM feature importance analysis. Analysis of judge importance across 20 independent model training runs. The GAM produces stable and reproducible feature importance rankings, with Truthfulness, Instruction Following, Clarity, Conciseness and Logical Consistency consistently ranking as top contributors, while Harmlessness and Explanatory Depth contribute minimally. Low variance in importance scores (error bars) indicates reliable interpretability across different training initializations.
  • Figure 4: Aggregator Performance Across Different Ground Truth Types: The top panel shows R² performance comparison across four ground truth types, with Persona Mean achieving the highest performance (GAM R² = 0.695). The bottom panel displays individual persona performance variation, with the Student persona achieving best results (R² = 0.693) and Child persona showing poorest alignment (R² = 0.442). This 25-percentage-point range reveals significant systematic differences in how well judge ensembles can align with different human preference profiles.
  • Figure 5: Aggregator robustness to persona contamination. Systematic bias shows gradual degradation, random noise remains stable until 15%, and scale compression causes most severe drops. System maintains reasonable performance up to 20% contamination.
  • ...and 4 more figures