Table of Contents
Fetching ...

Designing Multi-Robot Ground Video Sensemaking with Public Safety Professionals

Puqi Zhou, Ali Asgarov, Aafiya Hussain, Wonjoon Park, Amit Paudyal, Sameep Shrestha, Chia-wei Tang, Michael F. Lighthiser, Michael R. Hieb, Xuesu Xiao, Chris Thomas, Sungsoo Ray Hong

TL;DR

The paper introduces MRVS, a human–AI system for multi-robot ground video sensemaking designed with public safety professionals. It presents a testbed with 38 Events of Interest, a 20-video dataset, and six design requirements, then implements MRVS with a multimodal backend and interactive frontend evaluated through algorithmic benchmarks and expert interviews. The results show MRVS can increase recall and overall usefulness for public-safety workflows, while highlighting concerns about false alarms, privacy, and governance. The study argues that configurable, explainable AI coupled with collaboration-centric interfaces can meaningfully scale situational awareness in resource-constrained policing environments, with implications for broader adoption and responsible deployment.

Abstract

Videos from fleets of ground robots can advance public safety by providing scalable situational awareness and reducing professionals' burden. Yet little is known about how to design and integrate multi-robot videos into public safety workflows. Collaborating with six police agencies, we examined how such videos could be made practical. In Study 1, we presented the first testbed for multi-robot ground video sensemaking. The testbed includes 38 events-of-interest (EoI) relevant to public safety, a dataset of 20 robot patrol videos (10 day/night pairs) covering EoI types, and 6 design requirements aimed at improving current video sensemaking practices. In Study 2, we built MRVS, a tool that augments multi-robot patrol video streams with a prompt-engineered video understanding model. Participants reported reduced manual workload and greater confidence with LLM-based explanations, while noting concerns about false alarms and privacy. We conclude with implications for designing future multi-robot video sensemaking tools.

Designing Multi-Robot Ground Video Sensemaking with Public Safety Professionals

TL;DR

The paper introduces MRVS, a human–AI system for multi-robot ground video sensemaking designed with public safety professionals. It presents a testbed with 38 Events of Interest, a 20-video dataset, and six design requirements, then implements MRVS with a multimodal backend and interactive frontend evaluated through algorithmic benchmarks and expert interviews. The results show MRVS can increase recall and overall usefulness for public-safety workflows, while highlighting concerns about false alarms, privacy, and governance. The study argues that configurable, explainable AI coupled with collaboration-centric interfaces can meaningfully scale situational awareness in resource-constrained policing environments, with implications for broader adoption and responsible deployment.

Abstract

Videos from fleets of ground robots can advance public safety by providing scalable situational awareness and reducing professionals' burden. Yet little is known about how to design and integrate multi-robot videos into public safety workflows. Collaborating with six police agencies, we examined how such videos could be made practical. In Study 1, we presented the first testbed for multi-robot ground video sensemaking. The testbed includes 38 events-of-interest (EoI) relevant to public safety, a dataset of 20 robot patrol videos (10 day/night pairs) covering EoI types, and 6 design requirements aimed at improving current video sensemaking practices. In Study 2, we built MRVS, a tool that augments multi-robot patrol video streams with a prompt-engineered video understanding model. Participants reported reduced manual workload and greater confidence with LLM-based explanations, while noting concerns about false alarms and privacy. We conclude with implications for designing future multi-robot video sensemaking tools.
Paper Structure (56 sections, 7 figures, 4 tables)

This paper contains 56 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Research flow: (1) Identify 38 EoIs and 6 DRs from records1 (13,234 crime records from three US campuses and 10 research anomaly video datasets ma2015anomalyramachandra2020streetqian2025ucfwu2020notwang2020nwpuzhang2016singleacsintoae2022ubnormalpranav2020dayabnormal2013lurodrigues2020multi) with five public safety professionals survey reviews and interviews; (2) Create a 20-video multi-robot testbed simulating these EoIs; (3) Build the MRVS system with front-end interface and multimodal LLM back-end; (4) Evaluate via benchmarking and expert interviews with nine professionals.
  • Figure 2: Examples of different anomalies in our testbed shown in sequences. Each second column is manually zoomed in.
  • Figure 3: MRVS interface layout and corresponding design requirements.
  • Figure 4: F1.Browsing and investigating detected EoIs for situational awareness. Left: Professionals begin with a robot-level video debrief (a), where detected events are grouped by priority and sorted by time/urgency. Selecting an event opens an inspectable card (b) with triage actions (save/share), and a representative keyframe; with model confidence and rationale (c). Right: The situational overview summarized events across robots (d), supports keyword search (e), and filtering by shift, priority, and event type (f). Filtered events are presented as cards for rapid scanning (g) containing mark reviewed, save, or share items, quickly viewed event video segment, and clicking “Check Details” links a card to deeper inspection on the card.
  • Figure 5: F2. Reasoning across time and space. Professionals adjust the day/night window time for videos (a) and browse multiple robots via the video list, with toggles to show/hide videos linked to timeline and trajectory (b). Three layout options display map, video debrief, and video (c). The legend supports event type and entity filtering (d). Robot trajectories can be selected, with icons linking to keyframe popups and a "Check Details" for deeper inspection (e). Users can quickly preview corresponding video segments (f). A global timeline serves as a shared temporal reference (g), while per-robot timelines with synchronized playheads enable aligned cross-robot comparison and time-jump navigation for spatiotemporal reasoning.
  • ...and 2 more figures