Self-supervised visual learning for analyzing firearms trafficking activities on the Web
Sotirios Konstantakos, Despina Ioanna Chalkiadaki, Ioannis Mademlis, Adamantia Anna Rebolledo Chrysochoou, Georgios Th. Papadopoulos
TL;DR
This paper addresses the challenge of automatic firearms classification from web-sourced RGB images for open-source intelligence by evaluating self-supervised pretraining methods and a mixed pretraining scheme. It compares four SSL algorithms (SimCLR, DINO, MAE, DeepClusterV2) against supervised pretraining on ImageNet variants, using both Vision Transformer (ViT) and ResNet-50 backbones, and introduces a mixed SSL-supervised approach. The authors validate on CrawledFirearmsRGB, a 25k-image, 23-class dataset reflecting real-world web content, finding that SSL pretraining can yield substantial gains, with DINO (ViT) and SimCLR (ResNet-50) often delivering the best downstream accuracy, and that SSL often outperforms large-scale supervised pretraining on ImageNet-1k in this domain. The work demonstrates SSL’s potential to reduce data requirements for domain-specific firearm classification and highlights ViT-specific gains when paired with appropriate SSL pretraining, contributing both methodological insights and a new dataset for OSSINT workflows.
Abstract
Automated visual firearms classification from RGB images is an important real-world task with applications in public space security, intelligence gathering and law enforcement investigations. When applied to images massively crawled from the World Wide Web (including social media and dark Web sites), it can serve as an important component of systems that attempt to identify criminal firearms trafficking networks, by analyzing Big Data from open-source intelligence. Deep Neural Networks (DNN) are the state-of-the-art methodology for achieving this, with Convolutional Neural Networks (CNN) being typically employed. The common transfer learning approach consists of pretraining on a large-scale, generic annotated dataset for whole-image classification, such as ImageNet-1k, and then finetuning the DNN on a smaller, annotated, task-specific, downstream dataset for visual firearms classification. Neither Visual Transformer (ViT) neural architectures nor Self-Supervised Learning (SSL) approaches have been so far evaluated on this critical task..
