Table of Contents
Fetching ...

Large Language Models for Video Surveillance Applications

Ulindu De Silva, Leon Fernando, Billy Lau Pik Lik, Zann Koh, Sam Conrad Joyce, Belinda Yuen, Chau Yuen

TL;DR

This paper tackles the challenge of efficiently analyzing vast CCTV video data by generating tailored textual summaries using vision-language GenAI. It proposes a per-camera, frame-to-summary pipeline leveraging Gemini Pro Vision, with cross-camera fusion to create network-wide descriptions in response to user queries. Experimental evaluation on MSR-VTT and custom CCTV datasets demonstrates promising temporal and spatial summarization capabilities, supported by qualitative assessments. The findings suggest substantial gains in analysis efficiency and storage reduction, while noting the need for onboard deployment and future dataset development for rigorous evaluation.

Abstract

The rapid increase in video content production has resulted in enormous data volumes, creating significant challenges for efficient analysis and resource management. To address this, robust video analysis tools are essential. This paper presents an innovative proof of concept using Generative Artificial Intelligence (GenAI) in the form of Vision Language Models to enhance the downstream video analysis process. Our tool generates customized textual summaries based on user-defined queries, providing focused insights within extensive video datasets. Unlike traditional methods that offer generic summaries or limited action recognition, our approach utilizes Vision Language Models to extract relevant information, improving analysis precision and efficiency. The proposed method produces textual summaries from extensive CCTV footage, which can then be stored for an indefinite time in a very small storage space compared to videos, allowing users to quickly navigate and verify significant events without exhaustive manual review. Qualitative evaluations result in 80% and 70% accuracy in temporal and spatial quality and consistency of the pipeline respectively.

Large Language Models for Video Surveillance Applications

TL;DR

This paper tackles the challenge of efficiently analyzing vast CCTV video data by generating tailored textual summaries using vision-language GenAI. It proposes a per-camera, frame-to-summary pipeline leveraging Gemini Pro Vision, with cross-camera fusion to create network-wide descriptions in response to user queries. Experimental evaluation on MSR-VTT and custom CCTV datasets demonstrates promising temporal and spatial summarization capabilities, supported by qualitative assessments. The findings suggest substantial gains in analysis efficiency and storage reduction, while noting the need for onboard deployment and future dataset development for rigorous evaluation.

Abstract

The rapid increase in video content production has resulted in enormous data volumes, creating significant challenges for efficient analysis and resource management. To address this, robust video analysis tools are essential. This paper presents an innovative proof of concept using Generative Artificial Intelligence (GenAI) in the form of Vision Language Models to enhance the downstream video analysis process. Our tool generates customized textual summaries based on user-defined queries, providing focused insights within extensive video datasets. Unlike traditional methods that offer generic summaries or limited action recognition, our approach utilizes Vision Language Models to extract relevant information, improving analysis precision and efficiency. The proposed method produces textual summaries from extensive CCTV footage, which can then be stored for an indefinite time in a very small storage space compared to videos, allowing users to quickly navigate and verify significant events without exhaustive manual review. Qualitative evaluations result in 80% and 70% accuracy in temporal and spatial quality and consistency of the pipeline respectively.
Paper Structure (6 sections, 3 figures)

This paper contains 6 sections, 3 figures.

Figures (3)

  • Figure 1: Proposed Method Overview
  • Figure 2: CCTV Network
  • Figure 3: Temporal and Spatial analysis for two sample videos taken in a room from two different viewpoints one after the other. Highlighted in light green are occurrences when our method was able to detect information such as people, interactions, and environmental details which helps in creating a vivid picture of the scene.