Table of Contents
Fetching ...

Repetitive Action Counting with Hybrid Temporal Relation Modeling

Kun Li, Xinge Peng, Dan Guo, Xun Yang, Meng Wang

TL;DR

This work tackles Repetitive Action Counting (RAC) in videos by introducing the Hybrid Temporal Relation Modeling Network (HTRM-Net), which addresses the diversity and disruption of real-world action cycles. It creates rich temporal self-similarity representations through bi-modal TSSM (combining multi-head self-attention and dual-softmax), enriches them with a Random Matrix Dropping module, and injects local temporal context before fusing multi-scale information to regress a density map via a transformer-based decoder. The method achieves state-of-the-art results on RepCount-A and strong cross-dataset performance on UCFRep and QUVA, with substantial MAE and OBO gains over prior work (e.g., improvements of $20.04\%$ in MAE and $22.76\%$ in OBO on RepCount-A). These gains demonstrate robust RAC performance across unseen action categories and complex temporal dynamics, indicating practical potential for real-world video understanding tasks that require precise cycle counting. The approach combines advanced temporal modeling with efficient multi-scale fusion, offering a reliable framework for density-map-based RAC in diverse scenarios.

Abstract

Repetitive Action Counting (RAC) aims to count the number of repetitive actions occurring in videos. In the real world, repetitive actions have great diversity and bring numerous challenges (e.g., viewpoint changes, non-uniform periods, and action interruptions). Existing methods based on the temporal self-similarity matrix (TSSM) for RAC are trapped in the bottleneck of insufficient capturing action periods when applied to complicated daily videos. To tackle this issue, we propose a novel method named Hybrid Temporal Relation Modeling Network (HTRM-Net) to build diverse TSSM for RAC. The HTRM-Net mainly consists of three key components: bi-modal temporal self-similarity matrix modeling, random matrix dropping, and local temporal context modeling. Specifically, we construct temporal self-similarity matrices by bi-modal (self-attention and dual-softmax) operations, yielding diverse matrix representations from the combination of row-wise and column-wise correlations. To further enhance matrix representations, we propose incorporating a random matrix dropping module to guide channel-wise learning of the matrix explicitly. After that, we inject the local temporal context of video frames and the learned matrix into temporal correlation modeling, which can make the model robust enough to cope with error-prone situations, such as action interruption. Finally, a multi-scale matrix fusion module is designed to aggregate temporal correlations adaptively in multi-scale matrices. Extensive experiments across intra- and cross-datasets demonstrate that the proposed method not only outperforms current state-of-the-art methods but also exhibits robust capabilities in accurately counting repetitive actions in unseen action categories. Notably, our method surpasses the classical TransRAC method by 20.04\% in MAE and 22.76\% in OBO.

Repetitive Action Counting with Hybrid Temporal Relation Modeling

TL;DR

This work tackles Repetitive Action Counting (RAC) in videos by introducing the Hybrid Temporal Relation Modeling Network (HTRM-Net), which addresses the diversity and disruption of real-world action cycles. It creates rich temporal self-similarity representations through bi-modal TSSM (combining multi-head self-attention and dual-softmax), enriches them with a Random Matrix Dropping module, and injects local temporal context before fusing multi-scale information to regress a density map via a transformer-based decoder. The method achieves state-of-the-art results on RepCount-A and strong cross-dataset performance on UCFRep and QUVA, with substantial MAE and OBO gains over prior work (e.g., improvements of in MAE and in OBO on RepCount-A). These gains demonstrate robust RAC performance across unseen action categories and complex temporal dynamics, indicating practical potential for real-world video understanding tasks that require precise cycle counting. The approach combines advanced temporal modeling with efficient multi-scale fusion, offering a reliable framework for density-map-based RAC in diverse scenarios.

Abstract

Repetitive Action Counting (RAC) aims to count the number of repetitive actions occurring in videos. In the real world, repetitive actions have great diversity and bring numerous challenges (e.g., viewpoint changes, non-uniform periods, and action interruptions). Existing methods based on the temporal self-similarity matrix (TSSM) for RAC are trapped in the bottleneck of insufficient capturing action periods when applied to complicated daily videos. To tackle this issue, we propose a novel method named Hybrid Temporal Relation Modeling Network (HTRM-Net) to build diverse TSSM for RAC. The HTRM-Net mainly consists of three key components: bi-modal temporal self-similarity matrix modeling, random matrix dropping, and local temporal context modeling. Specifically, we construct temporal self-similarity matrices by bi-modal (self-attention and dual-softmax) operations, yielding diverse matrix representations from the combination of row-wise and column-wise correlations. To further enhance matrix representations, we propose incorporating a random matrix dropping module to guide channel-wise learning of the matrix explicitly. After that, we inject the local temporal context of video frames and the learned matrix into temporal correlation modeling, which can make the model robust enough to cope with error-prone situations, such as action interruption. Finally, a multi-scale matrix fusion module is designed to aggregate temporal correlations adaptively in multi-scale matrices. Extensive experiments across intra- and cross-datasets demonstrate that the proposed method not only outperforms current state-of-the-art methods but also exhibits robust capabilities in accurately counting repetitive actions in unseen action categories. Notably, our method surpasses the classical TransRAC method by 20.04\% in MAE and 22.76\% in OBO.

Paper Structure

This paper contains 35 sections, 13 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: (a) Illustration of Repetitive Action Counting (RAC), which aims to count the number of actions in a video. (b) The challenges in RAC include high-frequency periods, action interruption, no-uniform periods, and viewpoint changes hu2022transrac. (c) Performance comparison with classical method TransRAC hu2022transrac.
  • Figure 2: Illustration of (a) Temporal Self-Similarity Matrix (TSSM) in TransRAC hu2022transrac and (b) our method. RMD denotes the proposed Random Matrix Dropping in Sec. \ref{['sec:embedding']}. The TSSM in our method exhibits rich temporal self-similarities corresponding to the action periods. $\blacktriangle$ denotes the starting of each repetitive action.
  • Figure 3: Overview of the proposed Hybrid Temporal Relation Modeling Network (HTRM-Net). First, we extract multi-scale video features $\mathbf{V}_i, i\in\{1,2,3\}$, which are used to generate the temporal self-similarity matrix in a bi-modal manner and build the local temporal context concurrently. Then, in the bi-modal temporal self-similarity modeling, we use multi-head self-attention and dual-softmax operations to build fine-grained temporal relations in a spatial-wise manner. Subsequently, the random matrix dropping (RMD) is applied to the bi-modal matrices to build diverse matrix representation in a channel-wise manner through the random matrix dropping strategy. Meanwhile, we inject the local temporal context into the temporal self-similarity matrix to prevent error-prone situations. Additionally, we design a multi-scale self-similarity fusion module to fuse the temporal self-similarity matrix in each scale. Finally, we fuse the multi-scale matrices and use the decoder to predict the target density map $\hat{D}$. $T$ represents number of frames while $d$ denotes number of channels.
  • Figure 4: Ablation study of the drop ratio $p$ in the Random Matrix Dropping module and the parameter $\Delta K$ of the Local Temporal Context Modeling module on the RepCount-A dataset. $p$ changes with fixed $\Delta K$ = 2, and $\Delta K$ changes with fixed $p$ = 0.3.
  • Figure 5: Visualization of predicted density map on the RepCount-A dataset. (a), (b), and (c) are success cases, while (d) is a challenging case. In case (c), the latter frames of the video contain the action interruptions and viewpoint changes. "GT" denotes the density map of ground truth. "TransRAC" and "Ours" denote the prediction results of TransRAC hu2022transrac and our proposed method, respectively.
  • ...and 2 more figures