Table of Contents
Fetching ...

PV-VTT: A Privacy-Centric Dataset for Mission-Specific Anomaly Detection and Natural Language Interpretation

Ryozo Masukawa, Sanggeon Yun, Yoshiki Yamaguchi, Mohsen Imani

TL;DR

PV-VTT tackles pre-crime privacy-violation detection by constructing a privacy-centric dataset with anonymized text and frame-level feature vectors, enabling safe research in VAR and video captioning. The authors propose a hierarchical GNN framework that builds mission-specific knowledge graphs and integrates them with LLM prompting to generate descriptive captions from a single frame, significantly reducing API-token costs. The approach provides interpretability through explicit reasoning paths and improves cost-efficiency without sacrificing caption quality, as demonstrated on VAR and captioning benchmarks. This work advances privacy-aware surveillance research by coupling a novel dataset with a scalable, interpretable description pipeline.

Abstract

Video crime detection is a significant application of computer vision and artificial intelligence. However, existing datasets primarily focus on detecting severe crimes by analyzing entire video clips, often neglecting the precursor activities (i.e., privacy violations) that could potentially prevent these crimes. To address this limitation, we present PV-VTT (Privacy Violation Video To Text), a unique multimodal dataset aimed at identifying privacy violations. PV-VTT provides detailed annotations for both video and text in scenarios. To ensure the privacy of individuals in the videos, we only provide video feature vectors, avoiding the release of any raw video data. This privacy-focused approach allows researchers to use the dataset while protecting participant confidentiality. Recognizing that privacy violations are often ambiguous and context-dependent, we propose a Graph Neural Network (GNN)-based video description model. Our model generates a GNN-based prompt with image for Large Language Model (LLM), which deliver cost-effective and high-quality video descriptions. By leveraging a single video frame along with relevant text, our method reduces the number of input tokens required, maintaining descriptive quality while optimizing LLM API-usage. Extensive experiments validate the effectiveness and interpretability of our approach in video description tasks and flexibility of our PV-VTT dataset.

PV-VTT: A Privacy-Centric Dataset for Mission-Specific Anomaly Detection and Natural Language Interpretation

TL;DR

PV-VTT tackles pre-crime privacy-violation detection by constructing a privacy-centric dataset with anonymized text and frame-level feature vectors, enabling safe research in VAR and video captioning. The authors propose a hierarchical GNN framework that builds mission-specific knowledge graphs and integrates them with LLM prompting to generate descriptive captions from a single frame, significantly reducing API-token costs. The approach provides interpretability through explicit reasoning paths and improves cost-efficiency without sacrificing caption quality, as demonstrated on VAR and captioning benchmarks. This work advances privacy-aware surveillance research by coupling a novel dataset with a scalable, interpretable description pipeline.

Abstract

Video crime detection is a significant application of computer vision and artificial intelligence. However, existing datasets primarily focus on detecting severe crimes by analyzing entire video clips, often neglecting the precursor activities (i.e., privacy violations) that could potentially prevent these crimes. To address this limitation, we present PV-VTT (Privacy Violation Video To Text), a unique multimodal dataset aimed at identifying privacy violations. PV-VTT provides detailed annotations for both video and text in scenarios. To ensure the privacy of individuals in the videos, we only provide video feature vectors, avoiding the release of any raw video data. This privacy-focused approach allows researchers to use the dataset while protecting participant confidentiality. Recognizing that privacy violations are often ambiguous and context-dependent, we propose a Graph Neural Network (GNN)-based video description model. Our model generates a GNN-based prompt with image for Large Language Model (LLM), which deliver cost-effective and high-quality video descriptions. By leveraging a single video frame along with relevant text, our method reduces the number of input tokens required, maintaining descriptive quality while optimizing LLM API-usage. Extensive experiments validate the effectiveness and interpretability of our approach in video description tasks and flexibility of our PV-VTT dataset.

Paper Structure

This paper contains 21 sections, 9 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the privacy video data collection. (a) Frame-level annotation process. (b) Video description generation.
  • Figure 2: Distribution of video length (minutes)
  • Figure 3: Distribution of Privacy Violations by Case
  • Figure 4: (a) An overview of Mission-specific Knowledge Graph Generation and Video Classification: Messages are always passed from hierarchcally (1) sensor data nodes to LLM-generated key concept nodes, (2) key concept nodes to ConceptNet association nodes, and (3) association nodes to the final embedding node. (b) Framework to generate LLM prompt from the pretrained MissionGNN model.
  • Figure 5: Relationship between Quality and API-usage cost