Table of Contents
Fetching ...

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

Justin Zhao, Flor Miriam Plaza-del-Arco, Benjamin Genchel, Amanda Cercas Curry

TL;DR

The paper introduces the Language Model Council (LMC), a fully inclusive, democracy-style framework for evaluating foundation models on highly subjective tasks by collaboratively formulating tests, generating responses, and judging outcomes to produce a consensus ranking. It demonstrates the approach with a case study on emotional intelligence (EI) using 20 LLMs, showing that LMC rankings are more separable and align more closely with human judgments than individual judges or traditional benchmarks. Through Monte Carlo simulations and sub-council analyses, the work provides nuanced guidance on council size, test-set scale, and aggregation, highlighting robustness to adversarial judges and diminishing returns for very large councils. The findings have practical implications for trustworthy, human-aligned evaluation of LLMs and motivate careful design choices around test sets, judging protocols, and council composition. Limitations include generalizability to other tasks, single-turn English-only evaluation, and reproducibility challenges with evolving LLMs and closed models.

Abstract

As Large Language Models (LLMs) continue to evolve, evaluating them remains a persistent challenge. Many recent evaluations use LLMs as judges to score outputs from other LLMs, often relying on a single large model like GPT-4o. However, using a single LLM judge is prone to intra-model bias, and many tasks - such as those related to emotional intelligence, creative writing, and persuasiveness - may be too subjective for a single model to judge fairly. We introduce the Language Model Council (LMC), where a group of LLMs collaborate to create tests, respond to them, and evaluate each other's responses to produce a ranking in a democratic fashion. Unlike previous approaches that focus on reducing cost or bias by using a panel of smaller models, our work examines the benefits and nuances of a fully inclusive LLM evaluation system. In a detailed case study on emotional intelligence, we deploy a council of 20 recent LLMs to rank each other on open-ended responses to interpersonal conflicts. Our results show that the LMC produces rankings that are more separable and more robust, and through a user study, we show that they are more consistent with human evaluations than any individual LLM judge. Using all LLMs for judging can be costly, however, so we use Monte Carlo simulations and hand-curated sub-councils to study hypothetical council compositions and discuss the value of the incremental LLM judge.

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

TL;DR

The paper introduces the Language Model Council (LMC), a fully inclusive, democracy-style framework for evaluating foundation models on highly subjective tasks by collaboratively formulating tests, generating responses, and judging outcomes to produce a consensus ranking. It demonstrates the approach with a case study on emotional intelligence (EI) using 20 LLMs, showing that LMC rankings are more separable and align more closely with human judgments than individual judges or traditional benchmarks. Through Monte Carlo simulations and sub-council analyses, the work provides nuanced guidance on council size, test-set scale, and aggregation, highlighting robustness to adversarial judges and diminishing returns for very large councils. The findings have practical implications for trustworthy, human-aligned evaluation of LLMs and motivate careful design choices around test sets, judging protocols, and council composition. Limitations include generalizability to other tasks, single-turn English-only evaluation, and reproducibility challenges with evolving LLMs and closed models.

Abstract

As Large Language Models (LLMs) continue to evolve, evaluating them remains a persistent challenge. Many recent evaluations use LLMs as judges to score outputs from other LLMs, often relying on a single large model like GPT-4o. However, using a single LLM judge is prone to intra-model bias, and many tasks - such as those related to emotional intelligence, creative writing, and persuasiveness - may be too subjective for a single model to judge fairly. We introduce the Language Model Council (LMC), where a group of LLMs collaborate to create tests, respond to them, and evaluate each other's responses to produce a ranking in a democratic fashion. Unlike previous approaches that focus on reducing cost or bias by using a panel of smaller models, our work examines the benefits and nuances of a fully inclusive LLM evaluation system. In a detailed case study on emotional intelligence, we deploy a council of 20 recent LLMs to rank each other on open-ended responses to interpersonal conflicts. Our results show that the LMC produces rankings that are more separable and more robust, and through a user study, we show that they are more consistent with human evaluations than any individual LLM judge. Using all LLMs for judging can be costly, however, so we use Monte Carlo simulations and hand-curated sub-councils to study hypothetical council compositions and discuss the value of the incremental LLM judge.
Paper Structure (67 sections, 6 equations, 33 figures, 15 tables)

This paper contains 67 sections, 6 equations, 33 figures, 15 tables.

Figures (33)

  • Figure 1: Overview of the Language Model Council (LMC) evaluation framework. By using the same LLMs for test set formulation, task completion, and judging, the framework offers an equitable way to achieve an inclusive, consensus-based ranking.
  • Figure 2: Spearman correlation between EI score and key judging qualities across 20 LLM council members.
  • Figure 3: LLM rankings from different benchmarks.
  • Figure 4: Kendall-Tau correlation between benchmark scores and human study scores for nine LLMs (see Appendix \ref{['app_sec:human_study']}).
  • Figure 5: Measurements of rank stability (MERV) ((a) and (b)) and separability ((c) and (d)) averaged over 100 randomized trials for various numbers of judges and examples. (a) and (c) display raw metric values while (b) and (d) display the gradient magnitude (colors) and direction (arrows). The gradient calculation follows a Manhattan distance approach where row-wise and column-wise gradients are linearly combined to reflect the discreteness of changes between adjacent squares, highlighting the incremental impact of adding another judge or more examples.
  • ...and 28 more figures