Table of Contents
Fetching ...

Self-supervised visual learning for analyzing firearms trafficking activities on the Web

Sotirios Konstantakos, Despina Ioanna Chalkiadaki, Ioannis Mademlis, Adamantia Anna Rebolledo Chrysochoou, Georgios Th. Papadopoulos

TL;DR

This paper addresses the challenge of automatic firearms classification from web-sourced RGB images for open-source intelligence by evaluating self-supervised pretraining methods and a mixed pretraining scheme. It compares four SSL algorithms (SimCLR, DINO, MAE, DeepClusterV2) against supervised pretraining on ImageNet variants, using both Vision Transformer (ViT) and ResNet-50 backbones, and introduces a mixed SSL-supervised approach. The authors validate on CrawledFirearmsRGB, a 25k-image, 23-class dataset reflecting real-world web content, finding that SSL pretraining can yield substantial gains, with DINO (ViT) and SimCLR (ResNet-50) often delivering the best downstream accuracy, and that SSL often outperforms large-scale supervised pretraining on ImageNet-1k in this domain. The work demonstrates SSL’s potential to reduce data requirements for domain-specific firearm classification and highlights ViT-specific gains when paired with appropriate SSL pretraining, contributing both methodological insights and a new dataset for OSSINT workflows.

Abstract

Automated visual firearms classification from RGB images is an important real-world task with applications in public space security, intelligence gathering and law enforcement investigations. When applied to images massively crawled from the World Wide Web (including social media and dark Web sites), it can serve as an important component of systems that attempt to identify criminal firearms trafficking networks, by analyzing Big Data from open-source intelligence. Deep Neural Networks (DNN) are the state-of-the-art methodology for achieving this, with Convolutional Neural Networks (CNN) being typically employed. The common transfer learning approach consists of pretraining on a large-scale, generic annotated dataset for whole-image classification, such as ImageNet-1k, and then finetuning the DNN on a smaller, annotated, task-specific, downstream dataset for visual firearms classification. Neither Visual Transformer (ViT) neural architectures nor Self-Supervised Learning (SSL) approaches have been so far evaluated on this critical task..

Self-supervised visual learning for analyzing firearms trafficking activities on the Web

TL;DR

This paper addresses the challenge of automatic firearms classification from web-sourced RGB images for open-source intelligence by evaluating self-supervised pretraining methods and a mixed pretraining scheme. It compares four SSL algorithms (SimCLR, DINO, MAE, DeepClusterV2) against supervised pretraining on ImageNet variants, using both Vision Transformer (ViT) and ResNet-50 backbones, and introduces a mixed SSL-supervised approach. The authors validate on CrawledFirearmsRGB, a 25k-image, 23-class dataset reflecting real-world web content, finding that SSL pretraining can yield substantial gains, with DINO (ViT) and SimCLR (ResNet-50) often delivering the best downstream accuracy, and that SSL often outperforms large-scale supervised pretraining on ImageNet-1k in this domain. The work demonstrates SSL’s potential to reduce data requirements for domain-specific firearm classification and highlights ViT-specific gains when paired with appropriate SSL pretraining, contributing both methodological insights and a new dataset for OSSINT workflows.

Abstract

Automated visual firearms classification from RGB images is an important real-world task with applications in public space security, intelligence gathering and law enforcement investigations. When applied to images massively crawled from the World Wide Web (including social media and dark Web sites), it can serve as an important component of systems that attempt to identify criminal firearms trafficking networks, by analyzing Big Data from open-source intelligence. Deep Neural Networks (DNN) are the state-of-the-art methodology for achieving this, with Convolutional Neural Networks (CNN) being typically employed. The common transfer learning approach consists of pretraining on a large-scale, generic annotated dataset for whole-image classification, such as ImageNet-1k, and then finetuning the DNN on a smaller, annotated, task-specific, downstream dataset for visual firearms classification. Neither Visual Transformer (ViT) neural architectures nor Self-Supervised Learning (SSL) approaches have been so far evaluated on this critical task..
Paper Structure (15 sections, 6 equations, 2 figures, 5 tables, 4 algorithms)

This paper contains 15 sections, 6 equations, 2 figures, 5 tables, 4 algorithms.

Figures (2)

  • Figure 1: Sample images from the dataset in perez2020object.
  • Figure 2: Confusion matrix of DINO on ViT (single-dataset setup).