Table of Contents
Fetching ...

LLM-as-a-qualitative-judge: automating error analysis in natural language generation

Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana Perlić, Ekaterina Borisova, Markarit Vartampetian

TL;DR

The paper introduces LLM-as-a-qualitative-judge, a method that automates qualitative error analysis for natural language generation by generating structured reports of common error types. It combines open-ended per-instance analysis with a cumulative clustering algorithm to produce interpretable, human-like error-type reports across diverse NLG tasks. Through real-world and synthetic meta-evaluation on 12 datasets (~300 instances), the approach shows that LLMs can produce per-instance explanations aligned with human judgments in a substantial fraction of cases and that the aggregated reports can guide meaningful system improvements, as demonstrated in a BigBenchHard case study. The work provides a practical, reproducible framework for diagnosing NLG errors and offers tools and data to support broader meta-evaluation of LLM-based evaluators.

Abstract

Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that instance-specific issues output by LLM-as-a-qualitative-judge match those annotated by humans in 2/3 cases, and that LLM-as-a-qualitative-judge is capable of producing error type reports resembling the reports composed by human annotators. We also demonstrate in a case study how the use of LLM-as-a-qualitative-judge can substantially improve NLG systems performance. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.

LLM-as-a-qualitative-judge: automating error analysis in natural language generation

TL;DR

The paper introduces LLM-as-a-qualitative-judge, a method that automates qualitative error analysis for natural language generation by generating structured reports of common error types. It combines open-ended per-instance analysis with a cumulative clustering algorithm to produce interpretable, human-like error-type reports across diverse NLG tasks. Through real-world and synthetic meta-evaluation on 12 datasets (~300 instances), the approach shows that LLMs can produce per-instance explanations aligned with human judgments in a substantial fraction of cases and that the aggregated reports can guide meaningful system improvements, as demonstrated in a BigBenchHard case study. The work provides a practical, reproducible framework for diagnosing NLG errors and offers tools and data to support broader meta-evaluation of LLM-based evaluators.

Abstract

Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that instance-specific issues output by LLM-as-a-qualitative-judge match those annotated by humans in 2/3 cases, and that LLM-as-a-qualitative-judge is capable of producing error type reports resembling the reports composed by human annotators. We also demonstrate in a case study how the use of LLM-as-a-qualitative-judge can substantially improve NLG systems performance. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.

Paper Structure

This paper contains 57 sections, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Issue types reports for two datasets composed by the proposed LLM-as-a-qualitative-judge (GPT-4o) and by a human annotator. All steps of analysis performed by GPT-4o, including error types formulation and error grouping. The full generated report also includes comprehensive error type descriptions, omitted here due to the space limit. Cnt represents issue type counts.
  • Figure 2: Illustration of the proposed LLM-as-a-qualitative-judge approach.
  • Figure 3: Case study on three BigBenchHard tasks: after building a simple pipeline for a task, we perform two rounds of generating issue reports with LLM-as-a-qualitative-judge (a table with issue types and their counts) and manually revising the pipeline based solely on the generated reports. Task performance is improved in all cases.
  • Figure 4: Examples of per-instance analysis.
  • Figure 5: Examples of confusion matrices visualizing clustering agreement between LLM-as-qualitative-judge-generated and the annotator's issue types reports. We find the optimal mapping between clusters found by a human annotator and by LLM-as-a-qualitative-judge, and then define a confusion matrix where each cell $(i, j)$ denotes a number of dataset instances allocated into $i$-th annotator's cluster and $j$-th LLM-as-a-qualitative-judge's cluster.
  • ...and 5 more figures