Federated Learning for Video Violence Detection: Complementary Roles of Lightweight CNNs and Vision-Language Models for Energy-Efficient Use

Sébastien Thuau; Siba Haidar; Rachid Chelouah

Federated Learning for Video Violence Detection: Complementary Roles of Lightweight CNNs and Vision-Language Models for Energy-Efficient Use

Sébastien Thuau, Siba Haidar, Rachid Chelouah

TL;DR

The study tackles privacy and resource constraints in video violence detection by comparing zero-shot VLMs, LoRA-tuned VLMs, and personalized lightweight CNNs under non-IID federated settings. It demonstrates that all approaches achieve over 90% binary accuracy, with the 3D CNN delivering superior calibration and roughly half the energy of LoRA-tuned VLMs, while VLMs offer richer reasoning for complex scenes. Multiclass results on UCF-Crime improve markedly when applying hierarchical category groupings to reduce semantic confusion, and LoRA adaptation helps VLMs with non-IID data. The work provides energy and CO$_2$e quantification to guide hybrid deployment strategies that balance privacy, performance, and sustainability, advocating CNN-first screening with VLM escalation for high-context incidents in real-world surveillance. This framework supports scalable, environmentally conscious decision-making for edge-enabled, privacy-preserving video analytics in the DIVA context.

Abstract

Deep learning-based video surveillance increasingly demands privacy-preserving architectures with low computational and environmental overhead. Federated learning preserves privacy but deploying large vision-language models (VLMs) introduces major energy and sustainability challenges. We compare three strategies for federated violence detection under realistic non-IID splits on the RWF-2000 and RLVS datasets: zero-shot inference with pretrained VLMs, LoRA-based fine-tuning of LLaVA-NeXT-Video-7B, and personalized federated learning of a 65.8M-parameter 3D CNN. All methods exceed 90% accuracy in binary violence detection. The 3D CNN achieves superior calibration (ROC AUC 92.59%) at roughly half the energy cost (240 Wh vs. 570 Wh) of federated LoRA, while VLMs provide richer multimodal reasoning. Hierarchical category grouping (based on semantic similarity and class exclusion) boosts VLM multiclass accuracy from 65.31% to 81% on the UCF-Crime dataset. To our knowledge, this is the first comparative simulation study of LoRA-tuned VLMs and personalized CNNs for federated violence detection, with explicit energy and CO2e quantification. Our results inform hybrid deployment strategies that default to efficient CNNs for routine inference and selectively engage VLMs for complex contextual reasoning.

Federated Learning for Video Violence Detection: Complementary Roles of Lightweight CNNs and Vision-Language Models for Energy-Efficient Use

TL;DR

Abstract

Federated Learning for Video Violence Detection: Complementary Roles of Lightweight CNNs and Vision-Language Models for Energy-Efficient Use

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)