SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

Homaira Huda Shomee; Rochana Chaturvedi; Yangxinyu Xie; Tanwi Mallick

SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

Homaira Huda Shomee, Rochana Chaturvedi, Yangxinyu Xie, Tanwi Mallick

TL;DR

SCORE introduces a reference-free, multi-dimensional framework for evaluating LLM outputs in domain-specific hazard analysis and decision support. It builds a synthetic, context-rich dataset of 1,412 question–answer pairs across 40 professions and seven hazard types, grounded in user profiles and retrieved literature. The framework jointly measures specificity, robustness, answer relevance, and context utilization (plus readability), using multi-agent judgments, paraphrase and perturbation tests, masking, reranking, and leave-one-out analyses. Human and automated evaluations reveal that no single metric suffices and highlight subjectivity in expert-oriented tasks, emphasizing the value of a structured, multi-metric approach for safe, effective real-world deployment.

Abstract

Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types to support systematic evaluation. We further conduct human evaluation to assess inter-annotator agreement and alignment between model outputs and human judgments, which highlights the inherent subjectivity of open-ended, domain-specific evaluation. Our results show that no single metric sufficiently captures answer quality in isolation and demonstrate the need for structured, multi-metric evaluation frameworks when deploying LLMs in high-stakes applications.

SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

TL;DR

Abstract

Paper Structure (42 sections, 12 equations, 6 figures, 12 tables)

This paper contains 42 sections, 12 equations, 6 figures, 12 tables.

Introduction
Dataset Construction
User Profile.
Professional Context.
Location and Hazard Type.
Question Generation.
Answer Generation.
Evaluation Framework
Specificity
Robustness
Paraphrasing:
Perturbation:
Answer Relevance
Answer Relevance with Masking.
Context-Utilization
...and 27 more sections

Figures (6)

Figure 1: Specificity score computation: each generated answer is decomposed into atomic claims, specific details (hazard type, location, timeline, intensity) are extracted, and each claim is evaluated using multiple LLM-based agents before aggregating their judgments.
Figure 2: Robustness evaluation workflow. For each question–answer pair, the system generates paraphrased and hazard/location-perturbed variants of the question, runs them through the RAG pipeline, and compares the generated answers to assess semantic consistency and sensitivity to controlled perturbations.
Figure 3: Answer relevance pipeline. For each answer, the system generates reranked answer and answer relevance score. Outputs are shown inside blue boxes.
Figure 4: The annotation dashboard where each annotator receives ten questions and evaluates them by selecting the “Annotate” button for each item.
Figure 5: Human Annotation Interface of our evaluation pipeline.
...and 1 more figures

SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

TL;DR

Abstract

SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)