Table of Contents
Fetching ...

Multi-Focused Video Group Activities Hashing

Zhongmiao Qi, Yan Jiang, Bolin Zhang, Chong Wang, Lijun Guo, Pengjiang Qian, Jiangbo Qian

TL;DR

This work introduces STVH, a spatiotemporal interleaved video hashing framework that jointly models object dynamics and group interactions to produce compact $K$-bit hash codes for efficient group activity retrieval. Extending to M-STVH, it adds multi-focused hierarchical fusion and a binary filtering matrix to support activity-focused or visual-focused hashing while reducing storage, using PVF and SGAT to fuse visual and positional cues and a composite loss including $L_{cls}$, $L_q$, $L_CON$, and $L_{recon}$. Experiments on VD, CAD, and CAED demonstrate competitive classification and retrieval performance, with MSF enabling a transition from visual to activity semantics across layers and providing flexible retrieval modes. The method offers practical impact for sports analytics and surveillance by enabling scalable, activity-aware video search with controllable focus and storage efficiency, and opens avenues for cross-camera extension.

Abstract

With the explosive growth of video data in various complex scenarios, quickly retrieving group activities has become an urgent problem. However, many tasks can only retrieve videos focusing on an entire video, not the activity granularity. To solve this problem, we propose a new STVH (spatiotemporal interleaved video hashing) technique for the first time. Through a unified framework, the STVH simultaneously models individual object dynamics and group interactions, capturing the spatiotemporal evolution on both group visual features and positional features. Moreover, in real-life video retrieval scenarios, it may sometimes require activity features, while at other times, it may require visual features of objects. We then further propose a novel M-STVH (multi-focused spatiotemporal video hashing) as an enhanced version to handle this difficult task. The advanced method incorporates hierarchical feature integration through multi-focused representation learning, allowing the model to jointly focus on activity semantics features and object visual features. We conducted comparative experiments on publicly available datasets, and both STVH and M-STVH can achieve excellent results.

Multi-Focused Video Group Activities Hashing

TL;DR

This work introduces STVH, a spatiotemporal interleaved video hashing framework that jointly models object dynamics and group interactions to produce compact -bit hash codes for efficient group activity retrieval. Extending to M-STVH, it adds multi-focused hierarchical fusion and a binary filtering matrix to support activity-focused or visual-focused hashing while reducing storage, using PVF and SGAT to fuse visual and positional cues and a composite loss including , , , and . Experiments on VD, CAD, and CAED demonstrate competitive classification and retrieval performance, with MSF enabling a transition from visual to activity semantics across layers and providing flexible retrieval modes. The method offers practical impact for sports analytics and surveillance by enabling scalable, activity-aware video search with controllable focus and storage efficiency, and opens avenues for cross-camera extension.

Abstract

With the explosive growth of video data in various complex scenarios, quickly retrieving group activities has become an urgent problem. However, many tasks can only retrieve videos focusing on an entire video, not the activity granularity. To solve this problem, we propose a new STVH (spatiotemporal interleaved video hashing) technique for the first time. Through a unified framework, the STVH simultaneously models individual object dynamics and group interactions, capturing the spatiotemporal evolution on both group visual features and positional features. Moreover, in real-life video retrieval scenarios, it may sometimes require activity features, while at other times, it may require visual features of objects. We then further propose a novel M-STVH (multi-focused spatiotemporal video hashing) as an enhanced version to handle this difficult task. The advanced method incorporates hierarchical feature integration through multi-focused representation learning, allowing the model to jointly focus on activity semantics features and object visual features. We conducted comparative experiments on publicly available datasets, and both STVH and M-STVH can achieve excellent results.

Paper Structure

This paper contains 30 sections, 18 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Our methods and existing methods.
  • Figure 2: STVH consists of four main modules: 1) Visual Module: extracting features from the input video; 2) Positional Module: modeling spatiotemporal relationships on the input position information; 3) Spatiotemporal Interleaving Module: interleaving visual features and positional features; and 4) Hashing and Classification Learning Module: fusing the visual features with the position features and outputting the corresponding hash codes and classifications.
  • Figure 3: Modeling the action based on the computational IoU.
  • Figure 4: Sparse graph relation attention module (SGAT), fusing visual features as well as positional features at an attention.
  • Figure 5: M-STVH are composed of four modules: 1) Visual Module: extracting features from the input video; 2) Positional Module: modeling spatiotemporal relationships on the input position information; 3) Multi-Focused Spatiotemporal Interleaving Module: interleaving visual features and positional features at multiple layers; and 4) Hashing and Classification Learning Module: fusing the visual features with the position features and outputs the corresponding hash codes and classifications.
  • ...and 7 more figures