Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses
Fangyi Yu, Nabeel Seedat, Dasha Herrmannova, Frank Schilder, Jonathan Richard Schwarz
TL;DR
The paper introduces DeCE, a decomposed criteria-based evaluation framework for LLMs that separates precision (factual grounding) from recall (coverage of Required Information) using instance-specific gold-answer criteria. It formalizes evaluation on an (question, gold answer, model answer) triple and implements two workflows to compute decomposed scores with explicit element grounding and criterion satisfaction. In a real-world legal QA setting, DeCE outperforms lexical metrics and pointwise/multidimensional baselines in aligning with expert judgments (up to $r=0.78$ for F2), while revealing interpretive model behavior such as precision-recall trade-offs across jurisdictions and query types. The framework demonstrates scalability, with only $11.95\%$ of generated criteria requiring expert revision, and offers actionable diagnostics for domain-specific model improvement, positioning DeCE as a practical and interpretable evaluation tool for expert AI systems.
Abstract
Evaluating long-form answers in high-stakes domains such as law or medicine remains a fundamental challenge. Standard metrics like BLEU and ROUGE fail to capture semantic correctness, and current LLM-based evaluators often reduce nuanced aspects of answer quality into a single undifferentiated score. We introduce DeCE, a decomposed LLM evaluation framework that separates precision (factual accuracy and relevance) and recall (coverage of required concepts), using instance-specific criteria automatically extracted from gold answer requirements. DeCE is model-agnostic and domain-general, requiring no predefined taxonomies or handcrafted rubrics. We instantiate DeCE to evaluate different LLMs on a real-world legal QA task involving multi-jurisdictional reasoning and citation grounding. DeCE achieves substantially stronger correlation with expert judgments ($r=0.78$), compared to traditional metrics ($r=0.12$), pointwise LLM scoring ($r=0.35$), and modern multidimensional evaluators ($r=0.48$). It also reveals interpretable trade-offs: generalist models favor recall, while specialized models favor precision. Importantly, only 11.95% of LLM-generated criteria required expert revision, underscoring DeCE's scalability. DeCE offers an interpretable and actionable LLM evaluation framework in expert domains.
