HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment
Langqi Yang, Tianhang Zheng, Yixuan Chen, Kedong Xiu, Hao Zhou, Di Wang, Puning Zhao, Zhan Qin, Kui Ren
TL;DR
This paper presents HarmMetric Eval, a comprehensive benchmark for evaluating harmfulness metrics and judges applied to LLM outputs. It defines three fine-grained criteria unsafe, relevant and useful and builds a diverse dataset of 3.5k+ prompt–response pairs across 20+ harm categories, with a unified scoring mechanism that maps arbitrary outputs to a common effectiveness score. Across nearly 20 metrics and judges, conventional reference-based metrics such as ROUGE and METEOR outperform many LLM-based judges, challenging the assumption that semantic understanding alone yields best harm evaluation; the authors then propose HarmJudge and HarmClassifier as improved judges and show that fine-tuning with high-quality reference metrics yields state-of-the-art performance on HarmMetric Eval. The findings highlight important gaps in current harm assessment tools and offer practical guidance for designing robust, scalable safety evaluation for LLMs, with implications for benchmarking, metric selection, and model safety research.
Abstract
The potential for large language models (LLMs) to generate harmful content poses a significant safety risk in their deployment. To address and assess this risk, the community has developed numerous harmfulness evaluation metrics and judges. However, the lack of a systematic benchmark for evaluating these metrics and judges undermines the credibility and consistency of LLM safety assessments. To bridge this gap, we introduce HarmMetric Eval, a comprehensive benchmark designed to support both overall and fine-grained evaluation of harmfulness metrics and judges. In HarmMetric Eval, we build a high-quality dataset of representative harmful prompts paired with highly diverse harmful model responses and non-harmful counterparts across multiple categories. We also propose a flexible scoring mechanism that rewards the metrics for correctly ranking harmful responses above non-harmful ones, which is applicable to almost all existing metrics and judges with varying output formats and scoring scales. Using HarmMetric Eval, we uncover a surprising finding by extensive experiments: Conventional reference-based metrics such as ROUGE and METEOR can outperform existing LLM-based judges in fine-grained harmfulness evaluation, challenging prevailing assumptions about LLMs'superiority in this domain. To reveal the reasons behind this finding, we provide a fine-grained analysis to explain the limitations of LLM-based judges on rating irrelevant or useless responses. Furthermore, we build a new harmfulness judge by incorporating the fine-grained criteria into its prompt template and leverage reference-based metrics to fine-tune its base LLM. The resulting judge demonstrates superior performance than all existing metrics and judges in evaluating harmful responses.
