Table of Contents
Fetching ...

ACT360: An Efficient 360-Degree Action Detection and Summarization Framework for Mission-Critical Training and Debriefing

Aditi Tiwari, Klara Nahrstedt

TL;DR

ACT360 serves as a generalized framework for mission-critical debriefing, incorporating techniques such as EAC, spatial attention, summarization, and model optimization, which apply to any training environment requiring lightweight action detection and structured nost-exercise analysis.

Abstract

Effective training and debriefing are critical in high-stakes, mission-critical environments such as disaster response, military simulations, and industrial safety, where precision and minimizing errors are paramount. The traditional post-training analysis relies on manually reviewing 2D videos, a time-consuming process that lacks comprehensive situational awareness. To address these limitations, we introduce ACT360, a system that leverages 360-degree videos and machine learning for automated action detection and structured debriefing. ACT360 integrates 360YOWO, an enhanced You Only Watch Once (YOWO) model with spatial attention and equirectangular-aware convolution (EAC) to mitigate panoramic video distortions. To enable deployment in resource-constrained environments, we apply quantization and model pruning, reducing the model size by 74% while maintaining robust accuracy (mAP drop of only 1.5%, from 0.865 to 0.850) and improving inference speed. We validate our approach on a publicly available dataset of 55 labeled 360-degree videos covering seven key operational actions, recorded across various real-world training sessions and environmental conditions. Additionally, ACT360 integrates 360AIE (Action Insight Explorer), a web-based interface for automatic action detection, retrieval, and textual summarization using large language models (LLMs), significantly enhancing post-incident analysis efficiency. ACT360 serves as a generalized framework for mission-critical debriefing, incorporating EAC, spatial attention, summarization, and model optimization. These innovations apply to any training environment requiring lightweight action detection and structured post-exercise analysis.

ACT360: An Efficient 360-Degree Action Detection and Summarization Framework for Mission-Critical Training and Debriefing

TL;DR

ACT360 serves as a generalized framework for mission-critical debriefing, incorporating techniques such as EAC, spatial attention, summarization, and model optimization, which apply to any training environment requiring lightweight action detection and structured nost-exercise analysis.

Abstract

Effective training and debriefing are critical in high-stakes, mission-critical environments such as disaster response, military simulations, and industrial safety, where precision and minimizing errors are paramount. The traditional post-training analysis relies on manually reviewing 2D videos, a time-consuming process that lacks comprehensive situational awareness. To address these limitations, we introduce ACT360, a system that leverages 360-degree videos and machine learning for automated action detection and structured debriefing. ACT360 integrates 360YOWO, an enhanced You Only Watch Once (YOWO) model with spatial attention and equirectangular-aware convolution (EAC) to mitigate panoramic video distortions. To enable deployment in resource-constrained environments, we apply quantization and model pruning, reducing the model size by 74% while maintaining robust accuracy (mAP drop of only 1.5%, from 0.865 to 0.850) and improving inference speed. We validate our approach on a publicly available dataset of 55 labeled 360-degree videos covering seven key operational actions, recorded across various real-world training sessions and environmental conditions. Additionally, ACT360 integrates 360AIE (Action Insight Explorer), a web-based interface for automatic action detection, retrieval, and textual summarization using large language models (LLMs), significantly enhancing post-incident analysis efficiency. ACT360 serves as a generalized framework for mission-critical debriefing, incorporating EAC, spatial attention, summarization, and model optimization. These innovations apply to any training environment requiring lightweight action detection and structured post-exercise analysis.

Paper Structure

This paper contains 24 sections, 1 equation, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: 360-degree video frame showing ERP distortion near poles (orange box), demonstrating the need for ERP-aware processing.
  • Figure 2: ACT360 framework overview illustrating three main stages: (1) Data Preprocessing, (2) Model Training, and (3) Inference.
  • Figure 3: 360YOWO architecture and optimization pipeline. Top: Effects of quantization (FP32 to INT8) and pruning (QP) on individual components. Bottom: Dual-stream architecture integrating spatiotemporal (3D CNN) and spatial (2D CNN + EAC) processing, refined through CFAM-QP for final action detection.
  • Figure 4: 360AIE interface displaying key components, including video selection, detection overlays, zoomed-in action views, an action timeline, and text-based summaries.
  • Figure 5: Results of multi-user evaluation where six concurrent requests for different actions were sent to 360AIE. The plot shows the inference times for processing those requests for both short videos (less than six minutes) and long videos (more than ten minutes), highlighting the impact of video length on processing time.
  • ...and 3 more figures