How Safe is Your Safety Metric? Automatic Concatenation Tests for Metric Reliability

Ora Nova Fandina; Leshem Choshen; Eitan Farchi; George Kour; Yotam Perlitz; Orna Raz

How Safe is Your Safety Metric? Automatic Concatenation Tests for Metric Reliability

Ora Nova Fandina, Leshem Choshen, Eitan Farchi, George Kour, Yotam Perlitz, Orna Raz

TL;DR

This work addresses the reliability of harmfulness evaluation metrics $S$ for LLM outputs by introducing automatic concatenation-based tests that probe invariance to input length, content repetition, and input order. It applies the tests to reward-model-based metrics and LLM-as-a-Judge metrics (e.g., GPT-3.5, GPT-4o) and uncovers significant reliability gaps: reward-based metrics are sensitive to repetition, while GPT-based judges exhibit strong positional bias and, in some cases, fail to preserve cluster scores. The concatenation framework reveals that a metric can flip harmful content from unsafe to safe under concatenation, underscoring risks in safety evaluation pipelines. Overall, the paper provides a general, automatic methodology to diagnose metric reliability with practical implications for safer deployment of language models.

Abstract

Consider a scenario where a harmfulness evaluation metric intended to filter unsafe responses from a Large Language Model. When applied to individual harmful prompt-response pairs, it correctly flags them as unsafe by assigning a high-risk score. Yet, if those same pairs are concatenated, the metrics decision unexpectedly reverses - labelling the combined content as safe with a low score, allowing the harmful text to bypass the filter. We found that multiple safety metrics, including advanced metrics such as GPT-based judges, exhibit this non-safe behaviour. Moreover, they show a strong sensitivity to input order: responses are often classified as safe if safe content appears first, regardless of any harmful content that follows, and vice versa. These findings underscore the importance of evaluating the safety of safety metrics, that is, the reliability of their output scores. To address this, we developed general, automatic, concatenation-based tests to assess key properties of these metrics. When applied in a model safety scenario, the tests revealed significant inconsistencies in harmfulness evaluations.

How Safe is Your Safety Metric? Automatic Concatenation Tests for Metric Reliability

TL;DR

This work addresses the reliability of harmfulness evaluation metrics

for LLM outputs by introducing automatic concatenation-based tests that probe invariance to input length, content repetition, and input order. It applies the tests to reward-model-based metrics and LLM-as-a-Judge metrics (e.g., GPT-3.5, GPT-4o) and uncovers significant reliability gaps: reward-based metrics are sensitive to repetition, while GPT-based judges exhibit strong positional bias and, in some cases, fail to preserve cluster scores. The concatenation framework reveals that a metric can flip harmful content from unsafe to safe under concatenation, underscoring risks in safety evaluation pipelines. Overall, the paper provides a general, automatic methodology to diagnose metric reliability with practical implications for safer deployment of language models.

Abstract

Paper Structure (16 sections, 16 figures, 6 tables)

This paper contains 16 sections, 16 figures, 6 tables.

Introduction
Model Safety: Task and Metrics.
Concatenation Based Tests.
Results
Experimental setup.
Repetition tests: Reward-based metrics are sensitive to repeated content.
Cluster tests: Reward-based metrics and GPT-3.5 exhibit a notable decision-flipping rate.
Positional bias tests: GPT-3.5 and GPT-4o-based metrics have a strong positional bias.
Conclusions
Limitations
The System Prompt and Scoring Rules for the Judge Model
Additional Results
Repetition tests.
Concatenation and Permutation tests.
Cluster tests
...and 1 more sections

Figures (16)

Figure 1: Decision flipping: the GPT-3.5-based judge is prone to this phenomenon, with the verdict flipping for approximately $30\%$ of the data in our tests.
Figure 2: Positional bias: The GPT-4o-based metric exhibits a strong positional bias, classifying $60\%$ of the data as safe when the safe response is placed first, compared to only $10\%$ when the content is randomly ordered.
Figure 3: Reward-based metrics score range is [-8, 6], with higher scores denoting less harmful responses, while GPT-based metrics scores range from 1 to 5, with higher scores indicating more harmful responses.
Figure 4: The reward-based metrics are sensitive to repeated content.
Figure 5: 2-concatenated cluster. The analysis is in Appendix \ref{['appendix:more_results']}.
...and 11 more figures

How Safe is Your Safety Metric? Automatic Concatenation Tests for Metric Reliability

TL;DR

Abstract

How Safe is Your Safety Metric? Automatic Concatenation Tests for Metric Reliability

Authors

TL;DR

Abstract

Table of Contents

Figures (16)