Table of Contents
Fetching ...

HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment

Langqi Yang, Tianhang Zheng, Yixuan Chen, Kedong Xiu, Hao Zhou, Di Wang, Puning Zhao, Zhan Qin, Kui Ren

TL;DR

This paper presents HarmMetric Eval, a comprehensive benchmark for evaluating harmfulness metrics and judges applied to LLM outputs. It defines three fine-grained criteria unsafe, relevant and useful and builds a diverse dataset of 3.5k+ prompt–response pairs across 20+ harm categories, with a unified scoring mechanism that maps arbitrary outputs to a common effectiveness score. Across nearly 20 metrics and judges, conventional reference-based metrics such as ROUGE and METEOR outperform many LLM-based judges, challenging the assumption that semantic understanding alone yields best harm evaluation; the authors then propose HarmJudge and HarmClassifier as improved judges and show that fine-tuning with high-quality reference metrics yields state-of-the-art performance on HarmMetric Eval. The findings highlight important gaps in current harm assessment tools and offer practical guidance for designing robust, scalable safety evaluation for LLMs, with implications for benchmarking, metric selection, and model safety research.

Abstract

The potential for large language models (LLMs) to generate harmful content poses a significant safety risk in their deployment. To address and assess this risk, the community has developed numerous harmfulness evaluation metrics and judges. However, the lack of a systematic benchmark for evaluating these metrics and judges undermines the credibility and consistency of LLM safety assessments. To bridge this gap, we introduce HarmMetric Eval, a comprehensive benchmark designed to support both overall and fine-grained evaluation of harmfulness metrics and judges. In HarmMetric Eval, we build a high-quality dataset of representative harmful prompts paired with highly diverse harmful model responses and non-harmful counterparts across multiple categories. We also propose a flexible scoring mechanism that rewards the metrics for correctly ranking harmful responses above non-harmful ones, which is applicable to almost all existing metrics and judges with varying output formats and scoring scales. Using HarmMetric Eval, we uncover a surprising finding by extensive experiments: Conventional reference-based metrics such as ROUGE and METEOR can outperform existing LLM-based judges in fine-grained harmfulness evaluation, challenging prevailing assumptions about LLMs'superiority in this domain. To reveal the reasons behind this finding, we provide a fine-grained analysis to explain the limitations of LLM-based judges on rating irrelevant or useless responses. Furthermore, we build a new harmfulness judge by incorporating the fine-grained criteria into its prompt template and leverage reference-based metrics to fine-tune its base LLM. The resulting judge demonstrates superior performance than all existing metrics and judges in evaluating harmful responses.

HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment

TL;DR

This paper presents HarmMetric Eval, a comprehensive benchmark for evaluating harmfulness metrics and judges applied to LLM outputs. It defines three fine-grained criteria unsafe, relevant and useful and builds a diverse dataset of 3.5k+ prompt–response pairs across 20+ harm categories, with a unified scoring mechanism that maps arbitrary outputs to a common effectiveness score. Across nearly 20 metrics and judges, conventional reference-based metrics such as ROUGE and METEOR outperform many LLM-based judges, challenging the assumption that semantic understanding alone yields best harm evaluation; the authors then propose HarmJudge and HarmClassifier as improved judges and show that fine-tuning with high-quality reference metrics yields state-of-the-art performance on HarmMetric Eval. The findings highlight important gaps in current harm assessment tools and offer practical guidance for designing robust, scalable safety evaluation for LLMs, with implications for benchmarking, metric selection, and model safety research.

Abstract

The potential for large language models (LLMs) to generate harmful content poses a significant safety risk in their deployment. To address and assess this risk, the community has developed numerous harmfulness evaluation metrics and judges. However, the lack of a systematic benchmark for evaluating these metrics and judges undermines the credibility and consistency of LLM safety assessments. To bridge this gap, we introduce HarmMetric Eval, a comprehensive benchmark designed to support both overall and fine-grained evaluation of harmfulness metrics and judges. In HarmMetric Eval, we build a high-quality dataset of representative harmful prompts paired with highly diverse harmful model responses and non-harmful counterparts across multiple categories. We also propose a flexible scoring mechanism that rewards the metrics for correctly ranking harmful responses above non-harmful ones, which is applicable to almost all existing metrics and judges with varying output formats and scoring scales. Using HarmMetric Eval, we uncover a surprising finding by extensive experiments: Conventional reference-based metrics such as ROUGE and METEOR can outperform existing LLM-based judges in fine-grained harmfulness evaluation, challenging prevailing assumptions about LLMs'superiority in this domain. To reveal the reasons behind this finding, we provide a fine-grained analysis to explain the limitations of LLM-based judges on rating irrelevant or useless responses. Furthermore, we build a new harmfulness judge by incorporating the fine-grained criteria into its prompt template and leverage reference-based metrics to fine-tune its base LLM. The resulting judge demonstrates superior performance than all existing metrics and judges in evaluating harmful responses.

Paper Structure

This paper contains 23 sections, 11 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Dataset construction and evaluation pipeline of HarmMetric Eval. In the dataset construction stage, harmful prompts are collected from multiple data sources and combined with carefully designed prompt templates or various jailbreak attacks. Multiple LLMs are then used to generate diverse harmful model responses and non-harmful counterparts across fine-grained categories. In the evaluation stage, a flexible scoring mechanism compares the relative rankings of scores assigned by harmfulness metrics to harmful and non-harmful responses, producing both overall and fine-grained effectiveness scores for each metric. This pipeline enables comprehensive coverage of model behavior patterns and supports standardized, fine-grained assessment of nearly all existing harmfulness metrics and judges.
  • Figure 2: Model responses observed in real attack scenarios and their mapping to our non-harmful categories.
  • Figure 3: A failure case of a prompt-based LLM judge on HarmMetric Eval. For this harmful prompt, JailbreakBench Eval assigns a higher score to a useless affirmation response than to a genuinely harmful response, despite explicitly stating in its prompt template that harmful responses must “clearly provide specific details” and that affirmation-only answers without harmful content should be treated as non-harmful. This example reveals a potential weakness of LLM-based judges in overestimating the harmfulness of useless affirmation responses.
  • Figure 4: Overall effectiveness scores of ROUGE variants. Recall-based scoring and unigram-level similarity are more favorable for LLM harmfulness evaluation, while applying word stemming prior to similarity computation generally provides a modest improvement in evaluation effectiveness. In contrast, precision-based scoring and similarity measures that rely on contiguous token matching tend to negatively affect harmfulness evaluation performance.
  • Figure 5: Prompt template used by StrongReject Eval.
  • ...and 7 more figures