Sensitive Image Classification by Vision Transformers
Hanxian He, Campbell Wilson, Thanh Thi Nguyen, Janis Dalins
TL;DR
This work tackles the challenge of pornographic image classification, including CSAM-related content, by leveraging vision transformers to capture global context across image patches. It evaluates ViT, DeiT, Swin, and LITv2 models against a ResNet18 baseline on three datasets (2-class P2, 3-class P2 with a porn-indicative category, and ACI), highlighting the benefits of pretraining on ImageNet-1K. The results show that transformer-based models generally outperform the ResNet18 baseline, with LITv2 achieving the best balance between local and global attention and strong performance on the 3-class dataset under certain configurations, while the porn-indicative class remains challenging. The study underscores the importance of diverse training data and attention mechanisms for sensitive content detection and points to future work in transfer learning, ethically compliant datasets, and broader visualization to address practical deployment considerations.
Abstract
When it comes to classifying child sexual abuse images, managing similar inter-class correlations and diverse intra-class correlations poses a significant challenge. Vision transformer models, unlike conventional deep convolutional network models, leverage a self-attention mechanism to capture global interactions among contextual local elements. This allows them to navigate through image patches effectively, avoiding incorrect correlations and reducing ambiguity in attention maps, thus proving their efficacy in computer vision tasks. Rather than directly analyzing child sexual abuse data, we constructed two datasets: one comprising clean and pornographic images and another with three classes, which additionally include images indicative of pornography, sourced from Reddit and Google Open Images data. In our experiments, we also employ an adult content image benchmark dataset. These datasets served as a basis for assessing the performance of vision transformer models in pornographic image classification. In our study, we conducted a comparative analysis between various popular vision transformer models and traditional pre-trained ResNet models. Furthermore, we compared them with established methods for sensitive image detection such as attention and metric learning based CNN and Bumble. The findings demonstrated that vision transformer networks surpassed the benchmark pre-trained models, showcasing their superior classification and detection capabilities in this task.
