Table of Contents
Fetching ...

ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance

Hyojin Park, Yi Li, Janghoon Cho, Sungha Choi, Jungsoo Lee, Taotao Jing, Shuai Zhang, Munawar Hayat, Dashan Gao, Ning Bi, Fatih Porikli

Abstract

Despite decades of work, surveillance still struggles to find specific targets across long, multi-camera video. Prior methods -- tracking pipelines, CLIP based models, and VideoRAG -- require heavy manual filtering, capture only shallow attributes, and fail at temporal reasoning. Real-world searches are inherently multimodal (e.g., "When does this person join the fight?" with the person's image), yet this setting remains underexplored. Also, there are no proper benchmarks to evaluate those setting - asking video with multimodal queries. To address this gap, we introduce ForeSeaQA, a new benchmark specifically designed for video QA with image-and-text queries and timestamped annotations of key events. The dataset consists of long-horizon surveillance footage paired with diverse multimodal questions, enabling systematic evaluation of retrieval, temporal grounding, and multimodal reasoning in realistic forensic conditions. Not limited to this benchmark, we propose ForeSea, an AI forensic search system with a 3-stage, plug-and-play pipeline. (1) A tracking module filters irrelevant footage; (2) a multimodal embedding module indexes the remaining clips; and (3) during inference, the system retrieves top-K candidate clips for a Video Large Language Model (VideoLLM) to answer queries and localize events. On ForeSeaQA, ForeSea improves accuracy by 3.5% and temporal IoU by 11.0 over prior VideoRAG models. To our knowledge, ForeSeaQA is the first benchmark to support complex multimodal queries with precise temporal grounding, and ForeSea is the first VideoRAG system built to excel in this setting.

ForeSea: AI Forensic Search with Multi-modal Queries for Video Surveillance

Abstract

Despite decades of work, surveillance still struggles to find specific targets across long, multi-camera video. Prior methods -- tracking pipelines, CLIP based models, and VideoRAG -- require heavy manual filtering, capture only shallow attributes, and fail at temporal reasoning. Real-world searches are inherently multimodal (e.g., "When does this person join the fight?" with the person's image), yet this setting remains underexplored. Also, there are no proper benchmarks to evaluate those setting - asking video with multimodal queries. To address this gap, we introduce ForeSeaQA, a new benchmark specifically designed for video QA with image-and-text queries and timestamped annotations of key events. The dataset consists of long-horizon surveillance footage paired with diverse multimodal questions, enabling systematic evaluation of retrieval, temporal grounding, and multimodal reasoning in realistic forensic conditions. Not limited to this benchmark, we propose ForeSea, an AI forensic search system with a 3-stage, plug-and-play pipeline. (1) A tracking module filters irrelevant footage; (2) a multimodal embedding module indexes the remaining clips; and (3) during inference, the system retrieves top-K candidate clips for a Video Large Language Model (VideoLLM) to answer queries and localize events. On ForeSeaQA, ForeSea improves accuracy by 3.5% and temporal IoU by 11.0 over prior VideoRAG models. To our knowledge, ForeSeaQA is the first benchmark to support complex multimodal queries with precise temporal grounding, and ForeSea is the first VideoRAG system built to excel in this setting.
Paper Structure (32 sections, 3 equations, 6 figures, 7 tables)

This paper contains 32 sections, 3 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: AI Forensic Search with ForeSea. Our proposed framework for long surveillance videos supports complex multimodal queries (e.g., a reference image combined with a text question) and leverages a person-centric multimodal database to efficiently retrieve and generate temporally grounded answers.
  • Figure 2: ForeSeaQA Data Engine. We use text-only and multimodal LLMs to extract person entities from dense video captions, visually ground each entity to create query image crops, and generate multimodal QA pairs with timestamps. All generated QA samples and query images are reviewed by human workers for correctness.
  • Figure 3: Statistics of ForeSeaQA benchmark. (a) Task distribution by question. (b) Relative start position of ground-truth time ranges. (c) Statistics of video duration. (d) Comparison of benchmarks. Tasks: MC=multiple-choice, OE=open-ended, TG=temporal grounding, STG=spatiotemporal grounding. T_ann= Temporal annotation, MMq =Multimodal query.
  • Figure 4: Overview of ForeSea Pipeline.ForeSea consists of two main components: (1) Video Database Construction—a multimodal encoder embeds short video clips from the human tracking module and pairs them with metadata; (2) Query Answering—retrieves candidate videos from the database using a multimodal query and generates answers based on the retrieved content
  • Figure 5: Multimodal encoder produces (a) a video embedding from multiple frames and (b) a query embedding from text or image-text inputs
  • ...and 1 more figures