UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images

Yiting Qu; Xinyue Shen; Yixin Wu; Michael Backes; Savvas Zannettou; Yang Zhang

UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images

Yiting Qu, Xinyue Shen, Yixin Wu, Michael Backes, Savvas Zannettou, Yang Zhang

TL;DR

UnsafeBench provides a comprehensive benchmark for image safety classifiers across real-world and AI-generated content, revealing distribution shifts and robustness gaps in existing systems. The framework builds a 12,932-image, 11-category dataset and evaluates five conventional classifiers plus three VLM-based models, showing GPT-4V as the most effective yet constrained by cost and prompts. The authors introduce PerspectiveVision, a LoRA-finetuned LLaVA-based moderating tool that achieves state-of-the-art F1 and strong out-of-distribution generalization, especially for AI-generated images, and show improved robustness against adversarial attacks. The work highlights the necessity of AI-aware training and large foundation-model ensembles for reliable moderation in the era of generative AI, and provides an open-source baseline for future research.

Abstract

With the advent of text-to-image models and concerns about their misuse, developers are increasingly relying on image safety classifiers to moderate their generated unsafe images. Yet, the performance of current image safety classifiers remains unknown for both real-world and AI-generated images. In this work, we propose UnsafeBench, a benchmarking framework that evaluates the effectiveness and robustness of image safety classifiers, with a particular focus on the impact of AI-generated images on their performance. First, we curate a large dataset of 10K real-world and AI-generated images that are annotated as safe or unsafe based on a set of 11 unsafe categories of images (sexual, violent, hateful, etc.). Then, we evaluate the effectiveness and robustness of five popular image safety classifiers, as well as three classifiers that are powered by general-purpose visual language models. Our assessment indicates that existing image safety classifiers are not comprehensive and effective enough to mitigate the multifaceted problem of unsafe images. Also, there exists a distribution shift between real-world and AI-generated images in image qualities, styles, and layouts, leading to degraded effectiveness and robustness. Motivated by these findings, we build a comprehensive image moderation tool called PerspectiveVision, which improves the effectiveness and robustness of existing classifiers, especially on AI-generated images. UnsafeBench and PerspectiveVision can aid the research community in better understanding the landscape of image safety classification in the era of generative AI.

UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images

TL;DR

Abstract

Paper Structure (30 sections, 5 equations, 16 figures, 12 tables)

This paper contains 30 sections, 5 equations, 16 figures, 12 tables.

Introduction
Background
Unsafe Image Taxonomy
Overview of UnsafeBench
UnsafeBench Dataset
Image Classifier Collection
Aligning Classifier Coverage With Unsafe Categories
Effectiveness Assessment
Methodology
Effectiveness Result
Why are Certain Classifiers Less Effective on AI-Generated Images?
A Case Study on Artistic Representation and Grid Layout
Takeaways
Robustness Assessment
Methodology
...and 15 more sections

Figures (16)

Figure 1: High-level overview of UnsafeBench.
Figure 2: Average F1-Score and number of classifiers for each unsafe category.
Figure 3: Average F1-Score of classifiers on real-world and AI-generated images.
Figure 4: Image clusters from the Sexual category that are misclassified by SD_Filter, NSFW_Detector, and NudeNet. We annotate each central image with its cluster ID and cluster size. We blur sexual images for censoring purposes.
Figure 5: The original real-world image and its AI-generated variations applying the artistic style and grid layout. The original image is unsafe and correctly predicted by Q16. Text in red indicates that image variations are misclassified as safe.
...and 11 more figures

UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images

TL;DR

Abstract

UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images

Authors

TL;DR

Abstract

Table of Contents

Figures (16)