Table of Contents
Fetching ...

Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses

Fangyi Yu, Nabeel Seedat, Dasha Herrmannova, Frank Schilder, Jonathan Richard Schwarz

TL;DR

The paper introduces DeCE, a decomposed criteria-based evaluation framework for LLMs that separates precision (factual grounding) from recall (coverage of Required Information) using instance-specific gold-answer criteria. It formalizes evaluation on an (question, gold answer, model answer) triple and implements two workflows to compute decomposed scores with explicit element grounding and criterion satisfaction. In a real-world legal QA setting, DeCE outperforms lexical metrics and pointwise/multidimensional baselines in aligning with expert judgments (up to $r=0.78$ for F2), while revealing interpretive model behavior such as precision-recall trade-offs across jurisdictions and query types. The framework demonstrates scalability, with only $11.95\%$ of generated criteria requiring expert revision, and offers actionable diagnostics for domain-specific model improvement, positioning DeCE as a practical and interpretable evaluation tool for expert AI systems.

Abstract

Evaluating long-form answers in high-stakes domains such as law or medicine remains a fundamental challenge. Standard metrics like BLEU and ROUGE fail to capture semantic correctness, and current LLM-based evaluators often reduce nuanced aspects of answer quality into a single undifferentiated score. We introduce DeCE, a decomposed LLM evaluation framework that separates precision (factual accuracy and relevance) and recall (coverage of required concepts), using instance-specific criteria automatically extracted from gold answer requirements. DeCE is model-agnostic and domain-general, requiring no predefined taxonomies or handcrafted rubrics. We instantiate DeCE to evaluate different LLMs on a real-world legal QA task involving multi-jurisdictional reasoning and citation grounding. DeCE achieves substantially stronger correlation with expert judgments ($r=0.78$), compared to traditional metrics ($r=0.12$), pointwise LLM scoring ($r=0.35$), and modern multidimensional evaluators ($r=0.48$). It also reveals interpretable trade-offs: generalist models favor recall, while specialized models favor precision. Importantly, only 11.95% of LLM-generated criteria required expert revision, underscoring DeCE's scalability. DeCE offers an interpretable and actionable LLM evaluation framework in expert domains.

Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses

TL;DR

The paper introduces DeCE, a decomposed criteria-based evaluation framework for LLMs that separates precision (factual grounding) from recall (coverage of Required Information) using instance-specific gold-answer criteria. It formalizes evaluation on an (question, gold answer, model answer) triple and implements two workflows to compute decomposed scores with explicit element grounding and criterion satisfaction. In a real-world legal QA setting, DeCE outperforms lexical metrics and pointwise/multidimensional baselines in aligning with expert judgments (up to for F2), while revealing interpretive model behavior such as precision-recall trade-offs across jurisdictions and query types. The framework demonstrates scalability, with only of generated criteria requiring expert revision, and offers actionable diagnostics for domain-specific model improvement, positioning DeCE as a practical and interpretable evaluation tool for expert AI systems.

Abstract

Evaluating long-form answers in high-stakes domains such as law or medicine remains a fundamental challenge. Standard metrics like BLEU and ROUGE fail to capture semantic correctness, and current LLM-based evaluators often reduce nuanced aspects of answer quality into a single undifferentiated score. We introduce DeCE, a decomposed LLM evaluation framework that separates precision (factual accuracy and relevance) and recall (coverage of required concepts), using instance-specific criteria automatically extracted from gold answer requirements. DeCE is model-agnostic and domain-general, requiring no predefined taxonomies or handcrafted rubrics. We instantiate DeCE to evaluate different LLMs on a real-world legal QA task involving multi-jurisdictional reasoning and citation grounding. DeCE achieves substantially stronger correlation with expert judgments (), compared to traditional metrics (), pointwise LLM scoring (), and modern multidimensional evaluators (). It also reveals interpretable trade-offs: generalist models favor recall, while specialized models favor precision. Importantly, only 11.95% of LLM-generated criteria required expert revision, underscoring DeCE's scalability. DeCE offers an interpretable and actionable LLM evaluation framework in expert domains.

Paper Structure

This paper contains 45 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the DeCE evaluation pipeline. (a) The precision workflow decomposes the model-generated answer into factual elements, which are then individually verified for factual correctness and relevance against the gold answer. (b) The recall workflow extracts evaluation criteria from the gold answer Required Information and checks whether each criterion is satisfied in the model response. Together, these workflows yield decomposed scores that provide interpretable evaluation signals for expert-domain model evaluation.
  • Figure 2: Distribution of pointwise scores (0–4) for each model, judged by Claude 3.5 using rubric-based Likert evaluation. Gemini-2.5-Pro achieves the highest proportion of top-rated responses (71.3%), while legally fine-tuned Llama-3.1-70B shows lower scores, suggesting model scale may outweigh domain specialization for complex legal reasoning.
  • Figure 3: DeCE scores (precision and recall) for each evaluated model. Larger generalist models (e.g., Gemini, GPT-4o) demonstrate stronger recall, while legally fine-tuned models exhibit higher precision, highlighting complementary strengths.
  • Figure 4: Model performance across jurisdictions insights (precision vs. recall). Ohio State achieves high performance (recall: 0.98, precision: 0.55), while Texas State and Florida State & Federal show strong balanced performance. New York State & Federal exhibits low precision (0.38) despite moderate recall, and Minnesota State falls into the failure quadrant with both low precision and recall.
  • Figure 5: Model performance across query types (precision vs. recall). Basic concept inquiries achieve optimal performance (recall: 0.87, precision: 0.55), while source-specific requests show the poorest results (recall: 0.57, precision: 0.37). Complex legal reasoning problems consistently challenge all models, highlighting fundamental gaps in legal reasoning capabilities.