Table of Contents
Fetching ...

Video Summarisation with Incident and Context Information using Generative AI

Ulindu De Silva, Leon Fernando, Kalinga Bandara, Rashmika Nawaratne

TL;DR

This work tackles the challenge of efficiently analyzing vast surveillance video by replacing generic summaries with user-tailored textual narratives generated via Generative AI. The authors integrate YOLO-V8 for object detection with Gemini Pro Vision/ Gemini Pro for contextual analysis, guided by customizable prompts to identify and summarize incidents. They introduce a practical pipeline and validate it on MSR-VTT and CCTV footage, reporting a 72.8% similarity to ground truth and an 85% qualitative accuracy, indicating strong performance in industrial settings. The approach promises faster, more targeted video review for security and operations, with potential for on-device deployment and domain-specific fine-tuning.

Abstract

The proliferation of video content production has led to vast amounts of data, posing substantial challenges in terms of analysis efficiency and resource utilization. Addressing this issue calls for the development of robust video analysis tools. This paper proposes a novel approach leveraging Generative Artificial Intelligence (GenAI) to facilitate streamlined video analysis. Our tool aims to deliver tailored textual summaries of user-defined queries, offering a focused insight amidst extensive video datasets. Unlike conventional frameworks that offer generic summaries or limited action recognition, our method harnesses the power of GenAI to distil relevant information, enhancing analysis precision and efficiency. Employing YOLO-V8 for object detection and Gemini for comprehensive video and text analysis, our solution achieves heightened contextual accuracy. By combining YOLO with Gemini, our approach furnishes textual summaries extracted from extensive CCTV footage, enabling users to swiftly navigate and verify pertinent events without the need for exhaustive manual review. The quantitative evaluation revealed a similarity of 72.8%, while the qualitative assessment rated an accuracy of 85%, demonstrating the capability of the proposed method.

Video Summarisation with Incident and Context Information using Generative AI

TL;DR

This work tackles the challenge of efficiently analyzing vast surveillance video by replacing generic summaries with user-tailored textual narratives generated via Generative AI. The authors integrate YOLO-V8 for object detection with Gemini Pro Vision/ Gemini Pro for contextual analysis, guided by customizable prompts to identify and summarize incidents. They introduce a practical pipeline and validate it on MSR-VTT and CCTV footage, reporting a 72.8% similarity to ground truth and an 85% qualitative accuracy, indicating strong performance in industrial settings. The approach promises faster, more targeted video review for security and operations, with potential for on-device deployment and domain-specific fine-tuning.

Abstract

The proliferation of video content production has led to vast amounts of data, posing substantial challenges in terms of analysis efficiency and resource utilization. Addressing this issue calls for the development of robust video analysis tools. This paper proposes a novel approach leveraging Generative Artificial Intelligence (GenAI) to facilitate streamlined video analysis. Our tool aims to deliver tailored textual summaries of user-defined queries, offering a focused insight amidst extensive video datasets. Unlike conventional frameworks that offer generic summaries or limited action recognition, our method harnesses the power of GenAI to distil relevant information, enhancing analysis precision and efficiency. Employing YOLO-V8 for object detection and Gemini for comprehensive video and text analysis, our solution achieves heightened contextual accuracy. By combining YOLO with Gemini, our approach furnishes textual summaries extracted from extensive CCTV footage, enabling users to swiftly navigate and verify pertinent events without the need for exhaustive manual review. The quantitative evaluation revealed a similarity of 72.8%, while the qualitative assessment rated an accuracy of 85%, demonstrating the capability of the proposed method.
Paper Structure (18 sections, 1 equation, 3 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 1 equation, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Proposed Method Overview
  • Figure 2: Model output for video extracted from MSR-VTT dataset with video ID: 1360. The model consistently generates high-quality output, although it incorrectly assumes the comedian uses a guitar in the performance when it's merely part of the background. Also, inferring the comedian's talent solely based on this context is speculative, showcasing the model's occasional creativity beyond factual constraints.
  • Figure 3: CCTV traffic video