LLM-as-a-qualitative-judge: automating error analysis in natural language generation
Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana Perlić, Ekaterina Borisova, Markarit Vartampetian
TL;DR
The paper introduces LLM-as-a-qualitative-judge, a method that automates qualitative error analysis for natural language generation by generating structured reports of common error types. It combines open-ended per-instance analysis with a cumulative clustering algorithm to produce interpretable, human-like error-type reports across diverse NLG tasks. Through real-world and synthetic meta-evaluation on 12 datasets (~300 instances), the approach shows that LLMs can produce per-instance explanations aligned with human judgments in a substantial fraction of cases and that the aggregated reports can guide meaningful system improvements, as demonstrated in a BigBenchHard case study. The work provides a practical, reproducible framework for diagnosing NLG errors and offers tools and data to support broader meta-evaluation of LLM-based evaluators.
Abstract
Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that instance-specific issues output by LLM-as-a-qualitative-judge match those annotated by humans in 2/3 cases, and that LLM-as-a-qualitative-judge is capable of producing error type reports resembling the reports composed by human annotators. We also demonstrate in a case study how the use of LLM-as-a-qualitative-judge can substantially improve NLG systems performance. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.
