Table of Contents
Fetching ...

Transformer-Driven Multimodal Fusion for Explainable Suspiciousness Estimation in Visual Surveillance

Kuldeep Singh Yadav, Lalan Kumar

TL;DR

This work tackles real-time, explainable suspiciousness estimation in surveillance by introducing the USE50k multimodal dataset and the DeepUSEvision framework. It integrates a fast Suspicious Objects Detector, facial expression and body-language analyzers, and a transformer-based Fusion Discriminator to produce continuous risk scores with interpretability. Extensive experiments demonstrate strong detection accuracy, robust cross-dataset generalization, precise behavioral analysis, and a comprehensive explainability pipeline, all while achieving real-time performance. The dataset and framework collectively offer a scalable foundation for proactive security analytics in unconstrained environments.

Abstract

Suspiciousness estimation is critical for proactive threat detection and ensuring public safety in complex environments. This work introduces a large-scale annotated dataset, USE50k, along with a computationally efficient vision-based framework for real-time suspiciousness analysis. The USE50k dataset contains 65,500 images captured from diverse and uncontrolled environments, such as airports, railway stations, restaurants, parks, and other public areas, covering a broad spectrum of cues including weapons, fire, crowd density, abnormal facial expressions, and unusual body postures. Building on this dataset, we present DeepUSEvision, a lightweight and modular system integrating three key components, i.e., a Suspicious Object Detector based on an enhanced YOLOv12 architecture, dual Deep Convolutional Neural Networks (DCNN-I and DCNN-II) for facial expression and body-language recognition using image and landmark features, and a transformer-based Discriminator Network that adaptively fuses multimodal outputs to yield an interpretable suspiciousness score. Extensive experiments confirm the superior accuracy, robustness, and interpretability of the proposed framework compared to state-of-the-art approaches. Collectively, the USE50k dataset and the DeepUSEvision framework establish a strong and scalable foundation for intelligent surveillance and real-time risk assessment in safety-critical applications.

Transformer-Driven Multimodal Fusion for Explainable Suspiciousness Estimation in Visual Surveillance

TL;DR

This work tackles real-time, explainable suspiciousness estimation in surveillance by introducing the USE50k multimodal dataset and the DeepUSEvision framework. It integrates a fast Suspicious Objects Detector, facial expression and body-language analyzers, and a transformer-based Fusion Discriminator to produce continuous risk scores with interpretability. Extensive experiments demonstrate strong detection accuracy, robust cross-dataset generalization, precise behavioral analysis, and a comprehensive explainability pipeline, all while achieving real-time performance. The dataset and framework collectively offer a scalable foundation for proactive security analytics in unconstrained environments.

Abstract

Suspiciousness estimation is critical for proactive threat detection and ensuring public safety in complex environments. This work introduces a large-scale annotated dataset, USE50k, along with a computationally efficient vision-based framework for real-time suspiciousness analysis. The USE50k dataset contains 65,500 images captured from diverse and uncontrolled environments, such as airports, railway stations, restaurants, parks, and other public areas, covering a broad spectrum of cues including weapons, fire, crowd density, abnormal facial expressions, and unusual body postures. Building on this dataset, we present DeepUSEvision, a lightweight and modular system integrating three key components, i.e., a Suspicious Object Detector based on an enhanced YOLOv12 architecture, dual Deep Convolutional Neural Networks (DCNN-I and DCNN-II) for facial expression and body-language recognition using image and landmark features, and a transformer-based Discriminator Network that adaptively fuses multimodal outputs to yield an interpretable suspiciousness score. Extensive experiments confirm the superior accuracy, robustness, and interpretability of the proposed framework compared to state-of-the-art approaches. Collectively, the USE50k dataset and the DeepUSEvision framework establish a strong and scalable foundation for intelligent surveillance and real-time risk assessment in safety-critical applications.

Paper Structure

This paper contains 29 sections, 13 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Block diagram of the proposed DeepUSEvision system
  • Figure 2: Training behavior and qualitative detection performance of the proposed SOD module: a) Training losses of the SOD across epochs, b) Qualitative bounding-box predictions of the SOD.
  • Figure 3: Training Dynamics of the Proposed Transformer-Based Discriminator
  • Figure 4: Statistical Residual Distribution of Risk Score Prediction
  • Figure 5: Ground Truth vs. Predicted Suspiciousness Scores
  • ...and 6 more figures