PV-VTT: A Privacy-Centric Dataset for Mission-Specific Anomaly Detection and Natural Language Interpretation
Ryozo Masukawa, Sanggeon Yun, Yoshiki Yamaguchi, Mohsen Imani
TL;DR
PV-VTT tackles pre-crime privacy-violation detection by constructing a privacy-centric dataset with anonymized text and frame-level feature vectors, enabling safe research in VAR and video captioning. The authors propose a hierarchical GNN framework that builds mission-specific knowledge graphs and integrates them with LLM prompting to generate descriptive captions from a single frame, significantly reducing API-token costs. The approach provides interpretability through explicit reasoning paths and improves cost-efficiency without sacrificing caption quality, as demonstrated on VAR and captioning benchmarks. This work advances privacy-aware surveillance research by coupling a novel dataset with a scalable, interpretable description pipeline.
Abstract
Video crime detection is a significant application of computer vision and artificial intelligence. However, existing datasets primarily focus on detecting severe crimes by analyzing entire video clips, often neglecting the precursor activities (i.e., privacy violations) that could potentially prevent these crimes. To address this limitation, we present PV-VTT (Privacy Violation Video To Text), a unique multimodal dataset aimed at identifying privacy violations. PV-VTT provides detailed annotations for both video and text in scenarios. To ensure the privacy of individuals in the videos, we only provide video feature vectors, avoiding the release of any raw video data. This privacy-focused approach allows researchers to use the dataset while protecting participant confidentiality. Recognizing that privacy violations are often ambiguous and context-dependent, we propose a Graph Neural Network (GNN)-based video description model. Our model generates a GNN-based prompt with image for Large Language Model (LLM), which deliver cost-effective and high-quality video descriptions. By leveraging a single video frame along with relevant text, our method reduces the number of input tokens required, maintaining descriptive quality while optimizing LLM API-usage. Extensive experiments validate the effectiveness and interpretability of our approach in video description tasks and flexibility of our PV-VTT dataset.
