Table of Contents
Fetching ...

Short-Window Sliding Learning for Real-Time Violence Detection via LLM-based Auto-Labeling

Seoik Jung, Taekyung Song, Yangro Lee, Sungjun Lee

TL;DR

Problem: Real-time violence detection in CCTV footage requires high temporal fidelity and low computational load. Approach: a short-window sliding learning pipeline that divides long videos into 1–2 s clips, auto-labels them with LLM captions, refines labels with human review, and trains a vision-language transformer (InternVL3) with all frames per clip; Data construction uses a sliding window with $L$, $W$, $S$ and clip count $N = \frac{L-W}{S} + 1$ to minimize information loss. Findings: on RWF-2000 the method achieves $95.25\%$ accuracy; on UCF-Crime long, accuracy rises from $55.75\%$ to $83.25\%$ when adopting short clips; combining short-domain data yields $95.25\%$, indicating strong cross-domain generalization. Significance: this yields real-time, data-efficient violence detection with potential for multimodal extension to broader surveillance tasks.

Abstract

This paper proposes a Short-Window Sliding Learning framework for real-time violence detection in CCTV footages. Unlike conventional long-video training approaches, the proposed method divides videos into 1-2 second clips and applies Large Language Model (LLM)-based auto-caption labeling to construct fine-grained datasets. Each short clip fully utilizes all frames to preserve temporal continuity, enabling precise recognition of rapid violent events. Experiments demonstrate that the proposed method achieves 95.25\% accuracy on RWF-2000 and significantly improves performance on long videos (UCF-Crime: 83.25\%), confirming its strong generalization and real-time applicability in intelligent surveillance systems.

Short-Window Sliding Learning for Real-Time Violence Detection via LLM-based Auto-Labeling

TL;DR

Problem: Real-time violence detection in CCTV footage requires high temporal fidelity and low computational load. Approach: a short-window sliding learning pipeline that divides long videos into 1–2 s clips, auto-labels them with LLM captions, refines labels with human review, and trains a vision-language transformer (InternVL3) with all frames per clip; Data construction uses a sliding window with , , and clip count to minimize information loss. Findings: on RWF-2000 the method achieves accuracy; on UCF-Crime long, accuracy rises from to when adopting short clips; combining short-domain data yields , indicating strong cross-domain generalization. Significance: this yields real-time, data-efficient violence detection with potential for multimodal extension to broader surveillance tasks.

Abstract

This paper proposes a Short-Window Sliding Learning framework for real-time violence detection in CCTV footages. Unlike conventional long-video training approaches, the proposed method divides videos into 1-2 second clips and applies Large Language Model (LLM)-based auto-caption labeling to construct fine-grained datasets. Each short clip fully utilizes all frames to preserve temporal continuity, enabling precise recognition of rapid violent events. Experiments demonstrate that the proposed method achieves 95.25\% accuracy on RWF-2000 and significantly improves performance on long videos (UCF-Crime: 83.25\%), confirming its strong generalization and real-time applicability in intelligent surveillance systems.

Paper Structure

This paper contains 13 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Comparison of frame sampling methods.
  • Figure 2: Proposed data construction process. Step 1 involves dividing the original video into 1-2 second clips and generating automatic captions. Step 2 builds a refined short video dataset through LLM-based auto-labeling and human review.