Table of Contents
Fetching ...

DeepSVU: Towards In-depth Security-oriented Video Understanding via Unified Physical-world Regularized MoE

Yujie Jin, Wenxin Zhang, Jingjing Wang, Guodong Zhou

TL;DR

DeepSVU tackles the limitation of previous SVU approaches by enabling not only threat identification and localization but also causal attribution. It introduces the Unified Physical-world Regularized MoE (UPRM), featuring the Unified Physical-world Enhanced MoE (UPE) block that fuses coarse and fine-grained signals via HPE, ORE, VBE, and CVE, and the Physical-world Trade-off Regularizer (PTR) that mitigates data-imbalance biases among experts. A two-stage training pipeline and two instruction-based datasets (CUVA and UCF-C) support robust physical-world understanding and territorial threat reasoning. Empirical results demonstrate that UPRM outperforms both non-LLM baselines and state-of-the-art Video-LLMs in identifying, locating, and attributing threats, with convergence advantages and interpretable qualitative cases. The work advances practical threat monitoring by delivering precise timestamps and grounded cause explanations, paving the way for richer, context-aware security analytics.

Abstract

In the literature, prior research on Security-oriented Video Understanding (SVU) has predominantly focused on detecting and localize the threats (e.g., shootings, robberies) in videos, while largely lacking the effective capability to generate and evaluate the threat causes. Motivated by these gaps, this paper introduces a new chat paradigm SVU task, i.e., In-depth Security-oriented Video Understanding (DeepSVU), which aims to not only identify and locate the threats but also attribute and evaluate the causes threatening segments. Furthermore, this paper reveals two key challenges in the proposed task: 1) how to effectively model the coarse-to-fine physical-world information (e.g., human behavior, object interactions and background context) to boost the DeepSVU task; and 2) how to adaptively trade off these factors. To tackle these challenges, this paper proposes a new Unified Physical-world Regularized MoE (UPRM) approach. Specifically, UPRM incorporates two key components: the Unified Physical-world Enhanced MoE (UPE) Block and the Physical-world Trade-off Regularizer (PTR), to address the above two challenges, respectively. Extensive experiments conduct on our DeepSVU instructions datasets (i.e., UCF-C instructions and CUVA instructions) demonstrate that UPRM outperforms several advanced Video-LLMs as well as non-VLM approaches. Such information.These justify the importance of the coarse-to-fine physical-world information in the DeepSVU task and demonstrate the effectiveness of our UPRM in capturing such information.

DeepSVU: Towards In-depth Security-oriented Video Understanding via Unified Physical-world Regularized MoE

TL;DR

DeepSVU tackles the limitation of previous SVU approaches by enabling not only threat identification and localization but also causal attribution. It introduces the Unified Physical-world Regularized MoE (UPRM), featuring the Unified Physical-world Enhanced MoE (UPE) block that fuses coarse and fine-grained signals via HPE, ORE, VBE, and CVE, and the Physical-world Trade-off Regularizer (PTR) that mitigates data-imbalance biases among experts. A two-stage training pipeline and two instruction-based datasets (CUVA and UCF-C) support robust physical-world understanding and territorial threat reasoning. Empirical results demonstrate that UPRM outperforms both non-LLM baselines and state-of-the-art Video-LLMs in identifying, locating, and attributing threats, with convergence advantages and interpretable qualitative cases. The work advances practical threat monitoring by delivering precise timestamps and grounded cause explanations, paving the way for richer, context-aware security analytics.

Abstract

In the literature, prior research on Security-oriented Video Understanding (SVU) has predominantly focused on detecting and localize the threats (e.g., shootings, robberies) in videos, while largely lacking the effective capability to generate and evaluate the threat causes. Motivated by these gaps, this paper introduces a new chat paradigm SVU task, i.e., In-depth Security-oriented Video Understanding (DeepSVU), which aims to not only identify and locate the threats but also attribute and evaluate the causes threatening segments. Furthermore, this paper reveals two key challenges in the proposed task: 1) how to effectively model the coarse-to-fine physical-world information (e.g., human behavior, object interactions and background context) to boost the DeepSVU task; and 2) how to adaptively trade off these factors. To tackle these challenges, this paper proposes a new Unified Physical-world Regularized MoE (UPRM) approach. Specifically, UPRM incorporates two key components: the Unified Physical-world Enhanced MoE (UPE) Block and the Physical-world Trade-off Regularizer (PTR), to address the above two challenges, respectively. Extensive experiments conduct on our DeepSVU instructions datasets (i.e., UCF-C instructions and CUVA instructions) demonstrate that UPRM outperforms several advanced Video-LLMs as well as non-VLM approaches. Such information.These justify the importance of the coarse-to-fine physical-world information in the DeepSVU task and demonstrate the effectiveness of our UPRM in capturing such information.
Paper Structure (21 sections, 5 equations, 9 figures, 5 tables)

This paper contains 21 sections, 5 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: (a) The overall architecture of the Unified Physical-world Regularized MoE (UPRM) approach. (b) Unified Physical-world Enhanced MoE Block. It consists of two main components: Coarse-to-Fine Experts and Physical-world Trade-off Regularizer. (c) Coarse-grained and Fine-grained Experts, which is used to model the physical-world information.
  • Figure 2: Physical-world information statistics of our dataset.
  • Figure 3: Data composition for training and inference.
  • Figure 4: Comparisons of Weights Assigned by Different Experts. w/o PTR is exactly the basic MoE expert weighting.
  • Figure 5: Convergence analysis of UPRM and other Video-LLMs on (a) CUVA and (b) UCF-C instruction datasets.
  • ...and 4 more figures