Table of Contents
Fetching ...

A Benchmark for Crime Surveillance Video Analysis with Large Models

Haoran Chen, Dong Yi, Moyan Cao, Chensen Huang, Guibo Zhu, Jinqiao Wang

TL;DR

UCVL introduces a multi-task benchmark for crime surveillance video analysis by merging UCF-Crime labels with UCF-Crime Annotations to generate a unified QA framework across six QA types. QA content is produced by Qwen2-72B, and open-ended responses are scored by GPT-4o using a detailed rubric; eight MLLMs are benchmarked and two models are finetuned (LLaVA-UCVL). The results show general improvements with larger models but notable anomaly-blindness; finetuning the 7B model yields substantial gains, illustrating the benefit of domain adaptation. The scoring scheme uses a weighted total $Total = 0.15 \times S_{TF} + 0.1 \times S_{AC} + 0.15 \times S_{ED} + 0.15 \times S_{AD} + 0.2 \times S_{TG} + 0.25 \times S_{MCQ}$, demonstrating a robust evaluation protocol for open-ended MLLM reasoning in surveillance contexts.

Abstract

Anomaly analysis in surveillance videos is a crucial topic in computer vision. In recent years, multimodal large language models (MLLMs) have outperformed task-specific models in various domains. Although MLLMs are particularly versatile, their abilities to understand anomalous concepts and details are insufficiently studied because of the outdated benchmarks of this field not providing MLLM-style QAs and efficient algorithms to assess the model's open-ended text responses. To fill this gap, we propose a benchmark for crime surveillance video analysis with large models denoted as UCVL, including 1,829 videos and reorganized annotations from the UCF-Crime and UCF-Crime Annotation datasets. We design six types of questions and generate diverse QA pairs. Then we develop detailed instructions and use OpenAI's GPT-4o for accurate assessment. We benchmark eight prevailing MLLMs ranging from 0.5B to 40B parameters, and the results demonstrate the reliability of this bench. Moreover, we finetune LLaVA-OneVision on UCVL's training set. The improvement validates our data's high quality for video anomaly analysis.

A Benchmark for Crime Surveillance Video Analysis with Large Models

TL;DR

UCVL introduces a multi-task benchmark for crime surveillance video analysis by merging UCF-Crime labels with UCF-Crime Annotations to generate a unified QA framework across six QA types. QA content is produced by Qwen2-72B, and open-ended responses are scored by GPT-4o using a detailed rubric; eight MLLMs are benchmarked and two models are finetuned (LLaVA-UCVL). The results show general improvements with larger models but notable anomaly-blindness; finetuning the 7B model yields substantial gains, illustrating the benefit of domain adaptation. The scoring scheme uses a weighted total , demonstrating a robust evaluation protocol for open-ended MLLM reasoning in surveillance contexts.

Abstract

Anomaly analysis in surveillance videos is a crucial topic in computer vision. In recent years, multimodal large language models (MLLMs) have outperformed task-specific models in various domains. Although MLLMs are particularly versatile, their abilities to understand anomalous concepts and details are insufficiently studied because of the outdated benchmarks of this field not providing MLLM-style QAs and efficient algorithms to assess the model's open-ended text responses. To fill this gap, we propose a benchmark for crime surveillance video analysis with large models denoted as UCVL, including 1,829 videos and reorganized annotations from the UCF-Crime and UCF-Crime Annotation datasets. We design six types of questions and generate diverse QA pairs. Then we develop detailed instructions and use OpenAI's GPT-4o for accurate assessment. We benchmark eight prevailing MLLMs ranging from 0.5B to 40B parameters, and the results demonstrate the reliability of this bench. Moreover, we finetune LLaVA-OneVision on UCVL's training set. The improvement validates our data's high quality for video anomaly analysis.

Paper Structure

This paper contains 16 sections, 1 equation, 3 figures, 5 tables.

Figures (3)

  • Figure 1: A comparison of different models' performance on an "Assault" video from UCVL. The green lines highlight the correct answers and descriptions, while the red lines indicate wrong answers and descriptions. See more cases in Appendix A.
  • Figure 2: The pipeline of this work. We first parse data source and design task types. Then we use LLM to generate the video summary and QA pairs. Finally, we finetune two models and evaluate ten models on UCVL.
  • Figure 3: Heatmap of models' performance across 14 crime categories.