Table of Contents
Fetching ...

SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

Mohammed Talha Alam, Nada Saadi, Fahad Shamshad, Nils Lukas, Karthik Nandakumar, Fahkri Karray, Samuele Poppi

TL;DR

SPQR introduces a unified, reproducible benchmark for safety alignment in text-to-image diffusion models by evaluating four dimensions—Safety, Prompt adherence, Quality, and Robustness—under benign fine-tuning. It formalizes an unintentional attacker model and uses a harmonic-mean aggregation to produce a single, comparable score, demonstrating that many safety methods degrade under realistic post-deployment adaptations, with LoRA-based fine-tuning often being relatively safer. Across multilingual and domain-specific settings, SPQR reveals distribution-aware methods (e.g., RECE, MACE, UCE) as more robust, while others exhibit brittle safety under post-training drift. The benchmark provides a practical tool for developers and researchers to assess and compare safety techniques in deployment-like conditions, encouraging methods that embed safety more deeply and remain stable when models evolve post-release.

Abstract

Text-to-image diffusion models can emit copyrighted, unsafe, or private content. Safety alignment aims to suppress specific concepts, yet evaluations seldom test whether safety persists under benign downstream fine-tuning routinely applied after deployment (e.g., LoRA personalization, style/domain adapters). We study the stability of current safety methods under benign fine-tuning and observe frequent breakdowns. As true safety alignment must withstand even benign post-deployment adaptations, we introduce the SPQR benchmark (Safety-Prompt adherence-Quality-Robustness). SPQR is a single-scored metric that provides a standardized and reproducible framework to evaluate how well safety-aligned diffusion models preserve safety, utility, and robustness under benign fine-tuning, by reporting a single leaderboard score to facilitate comparisons. We conduct multilingual, domain-specific, and out-of-distribution analyses, along with category-wise breakdowns, to identify when safety alignment fails after benign fine-tuning, ultimately showcasing SPQR as a concise yet comprehensive benchmark for T2I safety alignment techniques for T2I models.

SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

TL;DR

SPQR introduces a unified, reproducible benchmark for safety alignment in text-to-image diffusion models by evaluating four dimensions—Safety, Prompt adherence, Quality, and Robustness—under benign fine-tuning. It formalizes an unintentional attacker model and uses a harmonic-mean aggregation to produce a single, comparable score, demonstrating that many safety methods degrade under realistic post-deployment adaptations, with LoRA-based fine-tuning often being relatively safer. Across multilingual and domain-specific settings, SPQR reveals distribution-aware methods (e.g., RECE, MACE, UCE) as more robust, while others exhibit brittle safety under post-training drift. The benchmark provides a practical tool for developers and researchers to assess and compare safety techniques in deployment-like conditions, encouraging methods that embed safety more deeply and remain stable when models evolve post-release.

Abstract

Text-to-image diffusion models can emit copyrighted, unsafe, or private content. Safety alignment aims to suppress specific concepts, yet evaluations seldom test whether safety persists under benign downstream fine-tuning routinely applied after deployment (e.g., LoRA personalization, style/domain adapters). We study the stability of current safety methods under benign fine-tuning and observe frequent breakdowns. As true safety alignment must withstand even benign post-deployment adaptations, we introduce the SPQR benchmark (Safety-Prompt adherence-Quality-Robustness). SPQR is a single-scored metric that provides a standardized and reproducible framework to evaluate how well safety-aligned diffusion models preserve safety, utility, and robustness under benign fine-tuning, by reporting a single leaderboard score to facilitate comparisons. We conduct multilingual, domain-specific, and out-of-distribution analyses, along with category-wise breakdowns, to identify when safety alignment fails after benign fine-tuning, ultimately showcasing SPQR as a concise yet comprehensive benchmark for T2I safety alignment techniques for T2I models.

Paper Structure

This paper contains 28 sections, 9 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Illustration of the direct 1 Tr SPQR direct 0 Tr benchmark. (Left) Example of a benign fine-tuning (BFT) causing safety regression: before BFT, Stable Diffusion produces a safe image, while after BFT, the same prompt yields a harmful one. (Center) SPQR evaluates models along four axes—Safety (S), Prompt Adherence (P), Quality (Q), and Robustness (R)—and aggregates them into a single harmonic mean score. (Right) Representative comparison showing how different safety-alignment methods vary across dimensions, highlighting strong pre-adaptation safety but weak robustness after benign fine-tuning.
  • Figure 2: Qualitative Examples of Safety Failure After Benign Fine-Tuning. Models that were initially safe (top row) generate harmful or explicit outputs after benign fine-tuning (bottom row), revealing a breakdown of safety across methods (ESD, AdvUnlearn, STEREO, RECE).
  • Figure 3: Example visualization of the trade-off between prompt adherence and robustness to benign fine-tuning.
  • Figure 4: S-P-Q-R Performance Profiles for Key Methods. Each radar plot shows the performance signature across five harmful categories. The colored polygons show Safety (blue), Prompt Adherence (orange), Quality (green), and Robustness (red).
  • Figure A: The unintentional threat of Benign Fine-Tunings. A simple, generally safe, fine-tuning can harm the safety of the current state-of-the-art safety-alignment techniques.
  • ...and 3 more figures