SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation
Homaira Huda Shomee, Rochana Chaturvedi, Yangxinyu Xie, Tanwi Mallick
TL;DR
SCORE introduces a reference-free, multi-dimensional framework for evaluating LLM outputs in domain-specific hazard analysis and decision support. It builds a synthetic, context-rich dataset of 1,412 question–answer pairs across 40 professions and seven hazard types, grounded in user profiles and retrieved literature. The framework jointly measures specificity, robustness, answer relevance, and context utilization (plus readability), using multi-agent judgments, paraphrase and perturbation tests, masking, reranking, and leave-one-out analyses. Human and automated evaluations reveal that no single metric suffices and highlight subjectivity in expert-oriented tasks, emphasizing the value of a structured, multi-metric approach for safe, effective real-world deployment.
Abstract
Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types to support systematic evaluation. We further conduct human evaluation to assess inter-annotator agreement and alignment between model outputs and human judgments, which highlights the inherent subjectivity of open-ended, domain-specific evaluation. Our results show that no single metric sufficiently captures answer quality in isolation and demonstrate the need for structured, multi-metric evaluation frameworks when deploying LLMs in high-stakes applications.
