Table of Contents
Fetching ...

Sensitive Image Classification by Vision Transformers

Hanxian He, Campbell Wilson, Thanh Thi Nguyen, Janis Dalins

TL;DR

This work tackles the challenge of pornographic image classification, including CSAM-related content, by leveraging vision transformers to capture global context across image patches. It evaluates ViT, DeiT, Swin, and LITv2 models against a ResNet18 baseline on three datasets (2-class P2, 3-class P2 with a porn-indicative category, and ACI), highlighting the benefits of pretraining on ImageNet-1K. The results show that transformer-based models generally outperform the ResNet18 baseline, with LITv2 achieving the best balance between local and global attention and strong performance on the 3-class dataset under certain configurations, while the porn-indicative class remains challenging. The study underscores the importance of diverse training data and attention mechanisms for sensitive content detection and points to future work in transfer learning, ethically compliant datasets, and broader visualization to address practical deployment considerations.

Abstract

When it comes to classifying child sexual abuse images, managing similar inter-class correlations and diverse intra-class correlations poses a significant challenge. Vision transformer models, unlike conventional deep convolutional network models, leverage a self-attention mechanism to capture global interactions among contextual local elements. This allows them to navigate through image patches effectively, avoiding incorrect correlations and reducing ambiguity in attention maps, thus proving their efficacy in computer vision tasks. Rather than directly analyzing child sexual abuse data, we constructed two datasets: one comprising clean and pornographic images and another with three classes, which additionally include images indicative of pornography, sourced from Reddit and Google Open Images data. In our experiments, we also employ an adult content image benchmark dataset. These datasets served as a basis for assessing the performance of vision transformer models in pornographic image classification. In our study, we conducted a comparative analysis between various popular vision transformer models and traditional pre-trained ResNet models. Furthermore, we compared them with established methods for sensitive image detection such as attention and metric learning based CNN and Bumble. The findings demonstrated that vision transformer networks surpassed the benchmark pre-trained models, showcasing their superior classification and detection capabilities in this task.

Sensitive Image Classification by Vision Transformers

TL;DR

This work tackles the challenge of pornographic image classification, including CSAM-related content, by leveraging vision transformers to capture global context across image patches. It evaluates ViT, DeiT, Swin, and LITv2 models against a ResNet18 baseline on three datasets (2-class P2, 3-class P2 with a porn-indicative category, and ACI), highlighting the benefits of pretraining on ImageNet-1K. The results show that transformer-based models generally outperform the ResNet18 baseline, with LITv2 achieving the best balance between local and global attention and strong performance on the 3-class dataset under certain configurations, while the porn-indicative class remains challenging. The study underscores the importance of diverse training data and attention mechanisms for sensitive content detection and points to future work in transfer learning, ethically compliant datasets, and broader visualization to address practical deployment considerations.

Abstract

When it comes to classifying child sexual abuse images, managing similar inter-class correlations and diverse intra-class correlations poses a significant challenge. Vision transformer models, unlike conventional deep convolutional network models, leverage a self-attention mechanism to capture global interactions among contextual local elements. This allows them to navigate through image patches effectively, avoiding incorrect correlations and reducing ambiguity in attention maps, thus proving their efficacy in computer vision tasks. Rather than directly analyzing child sexual abuse data, we constructed two datasets: one comprising clean and pornographic images and another with three classes, which additionally include images indicative of pornography, sourced from Reddit and Google Open Images data. In our experiments, we also employ an adult content image benchmark dataset. These datasets served as a basis for assessing the performance of vision transformer models in pornographic image classification. In our study, we conducted a comparative analysis between various popular vision transformer models and traditional pre-trained ResNet models. Furthermore, we compared them with established methods for sensitive image detection such as attention and metric learning based CNN and Bumble. The findings demonstrated that vision transformer networks surpassed the benchmark pre-trained models, showcasing their superior classification and detection capabilities in this task.

Paper Structure

This paper contains 15 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: A structure of the ViT transformer module (redrawn from dosovitskiy2020image).
  • Figure 2: A structure of the Swin transformer model (redrawn from liu2021swin). In a patch partitioning scheme with a 4 $\times$ 4 size, the approach generates feature vectors at the size of 48 from the input image. These vectors are obtained through a linear embedding layer and then processed by a Swin transformer block, resulting in a feature layer with dimension C.
  • Figure 3: The Structure of the Hi-Lo attention model LITv2 (redrawn from pan_fast_2023). $H_h$ represents the total number of attention heads. $\alpha$ represents the ratio of low-frequency attention heads to total attention heads.
  • Figure 4: Image classification accuracy with different models on the P2 3-class validation dataset with all models trained from scratch.
  • Figure 5: Image classification per-class accuracy with different models on P2 3-class dataset
  • ...and 1 more figures