Table of Contents
Fetching ...

PiercingEye: Dual-Space Video Violence Detection with Hyperbolic Vision-Language Guidance

Jiaxu Leng, Zhanjie Wu, Mingpi Tan, Mengjingcheng Mo, Jiankang Zheng, Qingqing Li, Ji Gan, Xinbo Gao

TL;DR

PiercingEye is a novel dual-space learning framework that synergizes Euclidean and hyperbolic geometries to enhance discriminative feature representation and achieves state-of-the-art performance on XD-Violence and UCF-Crime benchmarks.

Abstract

Existing weakly supervised video violence detection (VVD) methods primarily rely on Euclidean representation learning, which often struggles to distinguish visually similar yet semantically distinct events due to limited hierarchical modeling and insufficient ambiguous training samples. To address this challenge, we propose PiercingEye, a novel dual-space learning framework that synergizes Euclidean and hyperbolic geometries to enhance discriminative feature representation. Specifically, PiercingEye introduces a layer-sensitive hyperbolic aggregation strategy with hyperbolic Dirichlet energy constraints to progressively model event hierarchies, and a cross-space attention mechanism to facilitate complementary feature interactions between Euclidean and hyperbolic spaces. Furthermore, to mitigate the scarcity of ambiguous samples, we leverage large language models to generate logic-guided ambiguous event descriptions, enabling explicit supervision through a hyperbolic vision-language contrastive loss that prioritizes high-confusion samples via dynamic similarity-aware weighting. Extensive experiments on XD-Violence and UCF-Crime benchmarks demonstrate that PiercingEye achieves state-of-the-art performance, with particularly strong results on a newly curated ambiguous event subset, validating its superior capability in fine-grained violence detection.

PiercingEye: Dual-Space Video Violence Detection with Hyperbolic Vision-Language Guidance

TL;DR

PiercingEye is a novel dual-space learning framework that synergizes Euclidean and hyperbolic geometries to enhance discriminative feature representation and achieves state-of-the-art performance on XD-Violence and UCF-Crime benchmarks.

Abstract

Existing weakly supervised video violence detection (VVD) methods primarily rely on Euclidean representation learning, which often struggles to distinguish visually similar yet semantically distinct events due to limited hierarchical modeling and insufficient ambiguous training samples. To address this challenge, we propose PiercingEye, a novel dual-space learning framework that synergizes Euclidean and hyperbolic geometries to enhance discriminative feature representation. Specifically, PiercingEye introduces a layer-sensitive hyperbolic aggregation strategy with hyperbolic Dirichlet energy constraints to progressively model event hierarchies, and a cross-space attention mechanism to facilitate complementary feature interactions between Euclidean and hyperbolic spaces. Furthermore, to mitigate the scarcity of ambiguous samples, we leverage large language models to generate logic-guided ambiguous event descriptions, enabling explicit supervision through a hyperbolic vision-language contrastive loss that prioritizes high-confusion samples via dynamic similarity-aware weighting. Extensive experiments on XD-Violence and UCF-Crime benchmarks demonstrate that PiercingEye achieves state-of-the-art performance, with particularly strong results on a newly curated ambiguous event subset, validating its superior capability in fine-grained violence detection.

Paper Structure

This paper contains 31 sections, 1 theorem, 30 equations, 12 figures, 8 tables.

Key Result

Theorem 1

$\forall \textbf{x} \in \mathbb{L}^{n}$, $\textbf{M}\in \mathbb{R}^{(m+1)\times(n+1)}$, we have $f_{x}(\textbf{M})\textbf{x} \in \mathbb{L}_{K}^{m}$.

Figures (12)

  • Figure 1: Overview of the core idea behind the proposed PiercingEye framework. (a) The hierarchical structure of event categories in VVD, where ambiguous events—such as fighting (violent) and playing hockey (normal)—are visually similar but semantically distinct, making them difficult to distinguish using conventional methods. (b) The temporal hierarchy of event development, showing the semantic progression before, during, and after within an event, which provides contextual cues to mitigate ambiguity. (c) Modeling in a single space—whether Euclidean or hyperbolic—struggles to simultaneously capture visual features and hierarchical event relations, often leading to incorrect predictions for ambiguous events. (d) Our PiercingEye adopts a dual-space strategy: Euclidean space captures visual features, hyperbolic space models hierarchies, and logic-guided texts from VLMs/LLMs enable cross-modal alignment, enhancing ambiguous events detection.
  • Figure 2: A conceptual diagram of our PiercingEye. After initial feature extraction by two encoders, visual features are learned using GCN in Euclidean space and hierarchical relationships are modeled through HE-GCN in hyperbolic space. These representations are then enhanced through interaction via the DSI module to improve feature discriminability. Meanwhile, the generated ambiguous event descriptions through VLM and LLM are applied with a novel hyperbolic vision-language guided loss to guide the model in learning more discriminative features. Finally, a classifier is used to obtain the violence prediction score.
  • Figure 3: An illustration of the proposed HE-GCN. We first exponentiate the features into hyperbolic space and compute the lorentz similarity between nodes. Then, we calculate the hyperbolic Dirichlet energy and the layer-sensitive hyperbolic association degrees, which are used to construct the message graph, followed by message aggregation.
  • Figure 4: A conceptual diagram of our AETG. We first use a"scene analysis followed by behavior analysi" approach to prompt the VLM to generate textual descriptions for each frame. Then, based on our designed scene-action reasoning, we guide the LLM to systematically generate ambiguous text descriptions from the previously generated ones. The visualization results of AETG can be found in Section \ref{['fig:AETG_text_vis']}.
  • Figure 5: Examples of a collected subet of UCF-Crime.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Theorem 1