Table of Contents
Fetching ...

VisionGPT: LLM-Assisted Real-Time Anomaly Detection for Safe Visual Navigation

Hao Wang, Jiayou Qin, Ashish Bastola, Xiwen Chen, John Suchanek, Zihao Gong, Abolfazl Razi

TL;DR

The performance contribution of different prompt components is explored, the vision for future improvement in visual accessibility, and the way for LLMs in video anomaly detection and vision-language understanding is paved.

Abstract

This paper explores the potential of Large Language Models(LLMs) in zero-shot anomaly detection for safe visual navigation. With the assistance of the state-of-the-art real-time open-world object detection model Yolo-World and specialized prompts, the proposed framework can identify anomalies within camera-captured frames that include any possible obstacles, then generate concise, audio-delivered descriptions emphasizing abnormalities, assist in safe visual navigation in complex circumstances. Moreover, our proposed framework leverages the advantages of LLMs and the open-vocabulary object detection model to achieve the dynamic scenario switch, which allows users to transition smoothly from scene to scene, which addresses the limitation of traditional visual navigation. Furthermore, this paper explored the performance contribution of different prompt components, provided the vision for future improvement in visual accessibility, and paved the way for LLMs in video anomaly detection and vision-language understanding.

VisionGPT: LLM-Assisted Real-Time Anomaly Detection for Safe Visual Navigation

TL;DR

The performance contribution of different prompt components is explored, the vision for future improvement in visual accessibility, and the way for LLMs in video anomaly detection and vision-language understanding is paved.

Abstract

This paper explores the potential of Large Language Models(LLMs) in zero-shot anomaly detection for safe visual navigation. With the assistance of the state-of-the-art real-time open-world object detection model Yolo-World and specialized prompts, the proposed framework can identify anomalies within camera-captured frames that include any possible obstacles, then generate concise, audio-delivered descriptions emphasizing abnormalities, assist in safe visual navigation in complex circumstances. Moreover, our proposed framework leverages the advantages of LLMs and the open-vocabulary object detection model to achieve the dynamic scenario switch, which allows users to transition smoothly from scene to scene, which addresses the limitation of traditional visual navigation. Furthermore, this paper explored the performance contribution of different prompt components, provided the vision for future improvement in visual accessibility, and paved the way for LLMs in video anomaly detection and vision-language understanding.
Paper Structure (23 sections, 5 figures, 5 tables)

This paper contains 23 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Framework for vision-language processing and prompting.
  • Figure 2: Type H image splitter. (1) and (2) represent the left and right area, (3) represent the ground area, and (4) represent the front area.
  • Figure 3: ROC curve.
  • Figure 4: Confusion matrix of total frames. LLM setting is low-system sensitivity setting.
  • Figure 5: Anomaly annotation. The first row represents the labeled anomalies by the rule-based detector (binary), and the second row represents the anomalies predicted by the proposed LLM detector (float). Color represents the probability of anomalies.