Table of Contents
Fetching ...

How Safe is Your Safety Metric? Automatic Concatenation Tests for Metric Reliability

Ora Nova Fandina, Leshem Choshen, Eitan Farchi, George Kour, Yotam Perlitz, Orna Raz

TL;DR

This work addresses the reliability of harmfulness evaluation metrics $S$ for LLM outputs by introducing automatic concatenation-based tests that probe invariance to input length, content repetition, and input order. It applies the tests to reward-model-based metrics and LLM-as-a-Judge metrics (e.g., GPT-3.5, GPT-4o) and uncovers significant reliability gaps: reward-based metrics are sensitive to repetition, while GPT-based judges exhibit strong positional bias and, in some cases, fail to preserve cluster scores. The concatenation framework reveals that a metric can flip harmful content from unsafe to safe under concatenation, underscoring risks in safety evaluation pipelines. Overall, the paper provides a general, automatic methodology to diagnose metric reliability with practical implications for safer deployment of language models.

Abstract

Consider a scenario where a harmfulness evaluation metric intended to filter unsafe responses from a Large Language Model. When applied to individual harmful prompt-response pairs, it correctly flags them as unsafe by assigning a high-risk score. Yet, if those same pairs are concatenated, the metrics decision unexpectedly reverses - labelling the combined content as safe with a low score, allowing the harmful text to bypass the filter. We found that multiple safety metrics, including advanced metrics such as GPT-based judges, exhibit this non-safe behaviour. Moreover, they show a strong sensitivity to input order: responses are often classified as safe if safe content appears first, regardless of any harmful content that follows, and vice versa. These findings underscore the importance of evaluating the safety of safety metrics, that is, the reliability of their output scores. To address this, we developed general, automatic, concatenation-based tests to assess key properties of these metrics. When applied in a model safety scenario, the tests revealed significant inconsistencies in harmfulness evaluations.

How Safe is Your Safety Metric? Automatic Concatenation Tests for Metric Reliability

TL;DR

This work addresses the reliability of harmfulness evaluation metrics for LLM outputs by introducing automatic concatenation-based tests that probe invariance to input length, content repetition, and input order. It applies the tests to reward-model-based metrics and LLM-as-a-Judge metrics (e.g., GPT-3.5, GPT-4o) and uncovers significant reliability gaps: reward-based metrics are sensitive to repetition, while GPT-based judges exhibit strong positional bias and, in some cases, fail to preserve cluster scores. The concatenation framework reveals that a metric can flip harmful content from unsafe to safe under concatenation, underscoring risks in safety evaluation pipelines. Overall, the paper provides a general, automatic methodology to diagnose metric reliability with practical implications for safer deployment of language models.

Abstract

Consider a scenario where a harmfulness evaluation metric intended to filter unsafe responses from a Large Language Model. When applied to individual harmful prompt-response pairs, it correctly flags them as unsafe by assigning a high-risk score. Yet, if those same pairs are concatenated, the metrics decision unexpectedly reverses - labelling the combined content as safe with a low score, allowing the harmful text to bypass the filter. We found that multiple safety metrics, including advanced metrics such as GPT-based judges, exhibit this non-safe behaviour. Moreover, they show a strong sensitivity to input order: responses are often classified as safe if safe content appears first, regardless of any harmful content that follows, and vice versa. These findings underscore the importance of evaluating the safety of safety metrics, that is, the reliability of their output scores. To address this, we developed general, automatic, concatenation-based tests to assess key properties of these metrics. When applied in a model safety scenario, the tests revealed significant inconsistencies in harmfulness evaluations.
Paper Structure (16 sections, 16 figures, 6 tables)

This paper contains 16 sections, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Decision flipping: the GPT-3.5-based judge is prone to this phenomenon, with the verdict flipping for approximately $30\%$ of the data in our tests.
  • Figure 2: Positional bias: The GPT-4o-based metric exhibits a strong positional bias, classifying $60\%$ of the data as safe when the safe response is placed first, compared to only $10\%$ when the content is randomly ordered.
  • Figure 3: Reward-based metrics score range is [-8, 6], with higher scores denoting less harmful responses, while GPT-based metrics scores range from 1 to 5, with higher scores indicating more harmful responses.
  • Figure 4: The reward-based metrics are sensitive to repeated content.
  • Figure 5: 2-concatenated cluster. The analysis is in Appendix \ref{['appendix:more_results']}.
  • ...and 11 more figures