Table of Contents
Fetching ...

The Path to Reconciling Quality and Safety in Text-to-Image Generation: Dataset, Method, and Evaluation

Shouwei Ruan, Zhenyu Wu, Yao Huang, Ruochen Zhang, Yitong Sun, Caixin Kang, Shiji Zhao, Xingxing Wei

TL;DR

This work tackles the enduring safety-quality trade-off in text-to-image generation by introducing a unified framework that integrates data, method, and evaluation. It builds LibraAlign-100K, the first large-scale dataset with dual safety and quality annotations, and derives LibraAlign-HF via safety-aware inpainting to produce high-fidelity safe examples, all underpinned by a Safety Cost Model. The core methodological contribution, Synergistic Preference Optimization (T2I-SPO), combines a quality reward with a safety penalty into a composite objective and supplements training with a Dynamic Focusing Mechanism to emphasize hard cases. Evaluation is unified through the Unified Alignment Score (UAScore), which jointly considers sample quality and safety, enabling fair comparisons across safety-alignment methods. Empirically, T2I-SPO achieves state-of-the-art safety across diverse NSFW concepts while preserving generation quality, and demonstrates robustness to adversarial prompts, establishing a principled baseline for safe and creative T2I deployment.

Abstract

Content safety is a fundamental challenge for text-to-image (T2I) models, yet prevailing methods enforce a debilitating trade-off between safety and generation quality. We argue that mitigating this trade-off hinges on addressing systemic challenges in current T2I safety alignment across data, methods, and evaluation protocols. To this end, we introduce a unified framework for synergistic safety alignment. First, to overcome the flawed data paradigm that provides biased optimization signals, we develop LibraAlign-100K, the first large-scale dataset with dual annotations for safety and quality. Second, to address the myopic optimization of existing methods focus solely on safety reward, we propose Synergistic Preference Optimization (T2I-SPO), a novel alignment algorithm that extends the DPO paradigm with a composite reward function that integrates generation safety and quality to holistically model user preferences. Finally, to overcome the limitations of quality-agnostic and binary evaluation in current protocols, we introduce the Unified Alignment Score, a holistic, fine-grained metric that fairly quantifies the balance between safety and generative capability. Extensive experiments demonstrate that T2I-SPO achieves state-of-the-art safety alignment against a wide range of NSFW concepts, while better maintaining the model's generation quality and general capability

The Path to Reconciling Quality and Safety in Text-to-Image Generation: Dataset, Method, and Evaluation

TL;DR

This work tackles the enduring safety-quality trade-off in text-to-image generation by introducing a unified framework that integrates data, method, and evaluation. It builds LibraAlign-100K, the first large-scale dataset with dual safety and quality annotations, and derives LibraAlign-HF via safety-aware inpainting to produce high-fidelity safe examples, all underpinned by a Safety Cost Model. The core methodological contribution, Synergistic Preference Optimization (T2I-SPO), combines a quality reward with a safety penalty into a composite objective and supplements training with a Dynamic Focusing Mechanism to emphasize hard cases. Evaluation is unified through the Unified Alignment Score (UAScore), which jointly considers sample quality and safety, enabling fair comparisons across safety-alignment methods. Empirically, T2I-SPO achieves state-of-the-art safety across diverse NSFW concepts while preserving generation quality, and demonstrates robustness to adversarial prompts, establishing a principled baseline for safe and creative T2I deployment.

Abstract

Content safety is a fundamental challenge for text-to-image (T2I) models, yet prevailing methods enforce a debilitating trade-off between safety and generation quality. We argue that mitigating this trade-off hinges on addressing systemic challenges in current T2I safety alignment across data, methods, and evaluation protocols. To this end, we introduce a unified framework for synergistic safety alignment. First, to overcome the flawed data paradigm that provides biased optimization signals, we develop LibraAlign-100K, the first large-scale dataset with dual annotations for safety and quality. Second, to address the myopic optimization of existing methods focus solely on safety reward, we propose Synergistic Preference Optimization (T2I-SPO), a novel alignment algorithm that extends the DPO paradigm with a composite reward function that integrates generation safety and quality to holistically model user preferences. Finally, to overcome the limitations of quality-agnostic and binary evaluation in current protocols, we introduce the Unified Alignment Score, a holistic, fine-grained metric that fairly quantifies the balance between safety and generative capability. Extensive experiments demonstrate that T2I-SPO achieves state-of-the-art safety alignment against a wide range of NSFW concepts, while better maintaining the model's generation quality and general capability

Paper Structure

This paper contains 17 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Qualitative results. (Upper) The proposed T2I-SPO generates aesthetically pleasing and instruction-fidelity images while ensuring the removal of NSFW content. (Lower) A comparison with the state-of-the-art baseline, AlignGuard liu2025alignguard, on SD-v1.5 and SDXL. T2I-SPO demonstrates superior robustness against diverse harmful concepts while significantly preserving generation quality.
  • Figure 2: An overview of our framework for T2I safety alignment that reconciles safety and quality. (Left) Data: We construct LibraAlign-100K, the first dataset with dual annotations for generation quality and a fine-grained safety cost, the latter provided by our proposed Safety Cost Model. (Center) Method: Our T2I-SPO algorithm optimizes a composite reward function to synergistically balance safety and quality, while a Dynamic Focusing Mechanism enhances learning on hard examples. (Right) Evaluation: We propose the UAScore, a holistic metric that integrates safety cost and quality scores for a fair and comprehensive assessment of the alignment trade-off.
  • Figure 3: Creation process of image pairs in LibraAlign-HF.
  • Figure 4: (A) The curve of IP and CLIPScore by applying composite reward function with different $\lambda$. (B) Demonstration of T2I-SPO's robustness against adversarial prompt benchmarks.
  • Figure 5: Ablation analysis of the DFM. (Left) The plot of the training loss curves for different $\eta$ with DFM, and without DFM (i.e., $\eta=0$). (Right) The contribution of the number of augmentation components within DFM. We reports the converged loss with randomly ablate an number of augmentation types from the DFM.