Table of Contents
Fetching ...

Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications

Hongliu Cao, Ilias Driouich, Robin Singh, Eoin Thomas

TL;DR

This paper tackles the challenge of evaluating open-ended natural language generation by introducing a dynamic multi-agent LLM judge that automatically designs personalized evaluation prompts. The three-agent loop—Sample Selection, Evaluation, and ReWrite—iteratively refines prompts to balance predefined semantic similarity with downstream task adaptation, using $GPT$-3.5 for judgment and $GPT$-4 for rewriting. Empirical results on Instruct-QA demonstrate substantial accuracy gains (e.g., $AUC$ improving to $0.91$) and improved alignment with human judgments on the STSB benchmark (up to $r=0.81$). The work offers a scalable, cost-effective alternative to human evaluation and provides insights into effective prompt design for LLM-based evaluators, with future work addressing broader aspects such as faithfulness and bias mitigation.

Abstract

Large Language Models (LLMs) have demonstrated impressive performance across diverse domains, yet they still encounter challenges such as insufficient domain-specific knowledge, biases, and hallucinations. This underscores the need for robust evaluation methodologies to accurately assess LLM-based applications. Traditional evaluation methods, which rely on word overlap or text embeddings, are inadequate for capturing the nuanced semantic information necessary to evaluate dynamic, open-ended text generation. Recent research has explored leveraging LLMs to mimic human reasoning and decision-making processes for evaluation purposes known as LLM-as-a-judge framework. However, these existing frameworks have two significant limitations. First, they lack the flexibility to adapt to different text styles, including various answer and ground truth styles, thereby reducing their generalization performance. Second, the evaluation scores produced by these frameworks are often skewed and hard to interpret, showing a low correlation with human judgment. To address these challenges, we propose a novel dynamic multi-agent system that automatically designs personalized LLM judges for various natural language generation applications. This system iteratively refines evaluation prompts and balances the trade-off between the adaptive requirements of downstream tasks and the alignment with human perception. Our experimental results show that the proposed multi-agent LLM Judge framework not only enhances evaluation accuracy compared to existing methods but also produces evaluation scores that better align with human perception.

Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications

TL;DR

This paper tackles the challenge of evaluating open-ended natural language generation by introducing a dynamic multi-agent LLM judge that automatically designs personalized evaluation prompts. The three-agent loop—Sample Selection, Evaluation, and ReWrite—iteratively refines prompts to balance predefined semantic similarity with downstream task adaptation, using -3.5 for judgment and -4 for rewriting. Empirical results on Instruct-QA demonstrate substantial accuracy gains (e.g., improving to ) and improved alignment with human judgments on the STSB benchmark (up to ). The work offers a scalable, cost-effective alternative to human evaluation and provides insights into effective prompt design for LLM-based evaluators, with future work addressing broader aspects such as faithfulness and bias mitigation.

Abstract

Large Language Models (LLMs) have demonstrated impressive performance across diverse domains, yet they still encounter challenges such as insufficient domain-specific knowledge, biases, and hallucinations. This underscores the need for robust evaluation methodologies to accurately assess LLM-based applications. Traditional evaluation methods, which rely on word overlap or text embeddings, are inadequate for capturing the nuanced semantic information necessary to evaluate dynamic, open-ended text generation. Recent research has explored leveraging LLMs to mimic human reasoning and decision-making processes for evaluation purposes known as LLM-as-a-judge framework. However, these existing frameworks have two significant limitations. First, they lack the flexibility to adapt to different text styles, including various answer and ground truth styles, thereby reducing their generalization performance. Second, the evaluation scores produced by these frameworks are often skewed and hard to interpret, showing a low correlation with human judgment. To address these challenges, we propose a novel dynamic multi-agent system that automatically designs personalized LLM judges for various natural language generation applications. This system iteratively refines evaluation prompts and balances the trade-off between the adaptive requirements of downstream tasks and the alignment with human perception. Our experimental results show that the proposed multi-agent LLM Judge framework not only enhances evaluation accuracy compared to existing methods but also produces evaluation scores that better align with human perception.

Paper Structure

This paper contains 15 sections, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of answer correctness between human judge and an advanced LLM judge for a given query, ground-truth answer from TopicQA adlakha2022topiocqa, and LLM generated answer. While human judges can easily identify the generated answer as incorrect, the state-of-the-art LLM judge fails to recognize this simple error.
  • Figure 2: The proposed multi-agent LLM judge framework operates through the following workflow: Initially, the Prompt block contains the Initial Prompt, which can be updated in later phases. The Sample Selection agent's role is to select a diverse and representative set of examples for the Evaluation agent. The Evaluation agent tests these examples against the input prompt, providing an overall evaluation score as well as detailed feedback for improving the input prompt. The ReWrite agent then reviews both the input prompt and the feedback from the Evaluation agent to produce revised prompts that better guide the LLM judge. The iteration loop continues until the evaluation score meets the user's requirements or the maximum number of iterations is reached.
  • Figure 3: The experimental results on Instruct-QA datasets of different LLM judges: the X-axis denotes the False Positive Rate (FPR), and the Y-axis indicates the True Positive Rate (TPR). Each method's ROC curve is depicted in a distinct color, with the corresponding Area Under the Curve (AUC) values displayed in the bottom right corner of the figure.
  • Figure 4: Evaluation of the alignment between LLM judges and human perception: Pearson correlation between scores generated by LLM judges and human annotations are shown in this figure: the X-axis denotes different LLM judges and the Y-axis denotes the correlation score with human annotations.
  • Figure 5: A comparison between the Initial Prompt (displayed at the top) and the automatically optimized final prompt generated by the proposed multi-agent LLM judge (shown at the bottom).