Table of Contents
Fetching ...

Leveraging Self-Supervised Learning for Scene Classification in Child Sexual Abuse Imagery

Pedro H. V. Valois, João Macedo, Leo S. F. Ribeiro, Jefersson A. dos Santos, Sandra Avila

TL;DR

This paper addresses CSAI automation by proposing Indoor Scene Classification as a scalable proxy task and evaluating self-supervised models trained on scene-centric data. It systematically compares object-centric and scene-centric SSL pretraining, including synthetic indoor data, to optimize downstream Places8 performance, achieving 71.6% balanced accuracy with a Barlow Twins-based protocol. The study realism-tests the approach on out-of-distribution scenes and real CSAI data in collaboration with Brazilian Federal Police, revealing a substantial domain gap and limited efficacy of indoor-scene cues alone for CSAI classification. The findings demonstrate that SSL can enhance indoor scene understanding for CSAI triage, but underscore the need to incorporate people-aware cues and cross-domain generalization to responsibly support law enforcement while mitigating biases.

Abstract

Crime in the 21st century is split into a virtual and real world. However, the former has become a global menace to people's well-being and security in the latter. The challenges it presents must be faced with unified global cooperation, and we must rely more than ever on automated yet trustworthy tools to combat the ever-growing nature of online offenses. Over 10 million child sexual abuse reports are submitted to the US National Center for Missing \& Exploited Children every year, and over 80% originate from online sources. Therefore, investigation centers cannot manually process and correctly investigate all imagery. In light of that, reliable automated tools that can securely and efficiently deal with this data are paramount. In this sense, the scene classification task looks for contextual cues in the environment, being able to group and classify child sexual abuse data without requiring to be trained on sensitive material. The scarcity and limitations of working with child sexual abuse images lead to self-supervised learning, a machine-learning methodology that leverages unlabeled data to produce powerful representations that can be more easily transferred to downstream tasks. This work shows that self-supervised deep learning models pre-trained on scene-centric data can reach 71.6% balanced accuracy on our indoor scene classification task and, on average, 2.2 percentage points better performance than a fully supervised version. We cooperate with Brazilian Federal Police experts to evaluate our indoor classification model on actual child abuse material. The results demonstrate a notable discrepancy between the features observed in widely used scene datasets and those depicted on sensitive materials.

Leveraging Self-Supervised Learning for Scene Classification in Child Sexual Abuse Imagery

TL;DR

This paper addresses CSAI automation by proposing Indoor Scene Classification as a scalable proxy task and evaluating self-supervised models trained on scene-centric data. It systematically compares object-centric and scene-centric SSL pretraining, including synthetic indoor data, to optimize downstream Places8 performance, achieving 71.6% balanced accuracy with a Barlow Twins-based protocol. The study realism-tests the approach on out-of-distribution scenes and real CSAI data in collaboration with Brazilian Federal Police, revealing a substantial domain gap and limited efficacy of indoor-scene cues alone for CSAI classification. The findings demonstrate that SSL can enhance indoor scene understanding for CSAI triage, but underscore the need to incorporate people-aware cues and cross-domain generalization to responsibly support law enforcement while mitigating biases.

Abstract

Crime in the 21st century is split into a virtual and real world. However, the former has become a global menace to people's well-being and security in the latter. The challenges it presents must be faced with unified global cooperation, and we must rely more than ever on automated yet trustworthy tools to combat the ever-growing nature of online offenses. Over 10 million child sexual abuse reports are submitted to the US National Center for Missing \& Exploited Children every year, and over 80% originate from online sources. Therefore, investigation centers cannot manually process and correctly investigate all imagery. In light of that, reliable automated tools that can securely and efficiently deal with this data are paramount. In this sense, the scene classification task looks for contextual cues in the environment, being able to group and classify child sexual abuse data without requiring to be trained on sensitive material. The scarcity and limitations of working with child sexual abuse images lead to self-supervised learning, a machine-learning methodology that leverages unlabeled data to produce powerful representations that can be more easily transferred to downstream tasks. This work shows that self-supervised deep learning models pre-trained on scene-centric data can reach 71.6% balanced accuracy on our indoor scene classification task and, on average, 2.2 percentage points better performance than a fully supervised version. We cooperate with Brazilian Federal Police experts to evaluate our indoor classification model on actual child abuse material. The results demonstrate a notable discrepancy between the features observed in widely used scene datasets and those depicted on sensitive materials.
Paper Structure (19 sections, 5 figures, 4 tables)

This paper contains 19 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Methodology pipeline. Self-supervised methods are often used in a common two-stage training protocol: (1) the pretext task --- or pre-training stage --- and (2) the downstream task --- or fine-tuning stage. The pretext task uses unlabeled data and runs the SSL technique, while the downstream task uses labeled data from our downstream task. Our downstream stage is fixed, so we consider the pretext task through the lenses of the three research questions depicted (Section \ref{['sec:Methodology']}).
  • Figure 2: Box-plots of balanced accuracy versus the different techniques (SSL and supervised) fine-tuned on Places8.
  • Figure 3: Inference results on the OOD Scenes. Each row represents one label, and the name on the left is the true label for each image. A check mark (✓) highlights if the prediction matches the true label; otherwise, the predicted label is placed below the image.
  • Figure 4: Histogram of ground-truth labeled scenes within CSAI and Suspected CSAI categories.
  • Figure 5: Confusion matrix (%) for Places8 scenes classified in the CSAI dataset with values normalized by the number of elements in each class. "Classroom" and "dressing room" scenes are not present in this dataset.