Table of Contents
Fetching ...

Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting

Zhengqi Zhao, Xiaohu Huang, Hao Zhou, Kun Yao, Errui Ding, Jingdong Wang, Xinggang Wang, Wenyu Liu, Bin Feng

TL;DR

This work tackles repetitive action counting in videos, a challenging task when multiple actions and distractions are present. It proposes SkimFocusNet, a dual-branch architecture with a skim branch that consumes a long contextual view to guide a focus branch equipped with the long-short adaptive guidance (LSAG) module for precise frame-level counting. The authors introduce Multi-RepCount to reflect real-world scenarios with multiple action types and exemplar-guided specified counting, and they demonstrate state-of-the-art performance on RepCount, UCFRep, and Multi-RepCount with extensive ablations. The approach improves robustness, efficiency, and interpretability by separating global guidance from local counting, enabling effective action localization even in complex scenes.

Abstract

The key to action counting is accurately locating each video's repetitive actions. Instead of estimating the probability of each frame belonging to an action directly, we propose a dual-branch network, i.e., SkimFocusNet, working in a two-step manner. The model draws inspiration from empirical observations indicating that humans typically engage in coarse skimming of entire sequences to grasp the general action pattern initially, followed by a finer, frame-by-frame focus to determine if it aligns with the target action. Specifically, SkimFocusNet incorporates a skim branch and a focus branch. The skim branch scans the global contextual information throughout the sequence to identify potential target action for guidance. Subsequently, the focus branch utilizes the guidance to diligently identify repetitive actions using a long-short adaptive guidance (LSAG) block. Additionally, we have observed that videos in existing datasets often feature only one type of repetitive action, which inadequately represents real-world scenarios. To more accurately describe real-life situations, we establish the Multi-RepCount dataset, which includes videos containing multiple repetitive motions. On Multi-RepCount, our SkimFoucsNet can perform specified action counting, that is, to enable counting a particular action type by referencing an exemplary video. This capability substantially exhibits the robustness of our method. Extensive experiments demonstrate that SkimFocusNet achieves state-of-the-art performances with significant improvements. We also conduct a thorough ablation study to evaluate the network components. The source code will be published upon acceptance.

Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting

TL;DR

This work tackles repetitive action counting in videos, a challenging task when multiple actions and distractions are present. It proposes SkimFocusNet, a dual-branch architecture with a skim branch that consumes a long contextual view to guide a focus branch equipped with the long-short adaptive guidance (LSAG) module for precise frame-level counting. The authors introduce Multi-RepCount to reflect real-world scenarios with multiple action types and exemplar-guided specified counting, and they demonstrate state-of-the-art performance on RepCount, UCFRep, and Multi-RepCount with extensive ablations. The approach improves robustness, efficiency, and interpretability by separating global guidance from local counting, enabling effective action localization even in complex scenes.

Abstract

The key to action counting is accurately locating each video's repetitive actions. Instead of estimating the probability of each frame belonging to an action directly, we propose a dual-branch network, i.e., SkimFocusNet, working in a two-step manner. The model draws inspiration from empirical observations indicating that humans typically engage in coarse skimming of entire sequences to grasp the general action pattern initially, followed by a finer, frame-by-frame focus to determine if it aligns with the target action. Specifically, SkimFocusNet incorporates a skim branch and a focus branch. The skim branch scans the global contextual information throughout the sequence to identify potential target action for guidance. Subsequently, the focus branch utilizes the guidance to diligently identify repetitive actions using a long-short adaptive guidance (LSAG) block. Additionally, we have observed that videos in existing datasets often feature only one type of repetitive action, which inadequately represents real-world scenarios. To more accurately describe real-life situations, we establish the Multi-RepCount dataset, which includes videos containing multiple repetitive motions. On Multi-RepCount, our SkimFoucsNet can perform specified action counting, that is, to enable counting a particular action type by referencing an exemplary video. This capability substantially exhibits the robustness of our method. Extensive experiments demonstrate that SkimFocusNet achieves state-of-the-art performances with significant improvements. We also conduct a thorough ablation study to evaluate the network components. The source code will be published upon acceptance.
Paper Structure (20 sections, 4 equations, 9 figures, 12 tables)

This paper contains 20 sections, 4 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Illustration of the life experience that humans count actions in two steps: (1) We skim the sequence coarsely and acquire the possible target action pattern as guidance. (2) With the guidance of the skim process, we focus on the frames that include the target action to conduct counting.
  • Figure 2: Framework overview. There are two branches in SkimFocusNet, i.e., the skim branch and the focus branch. The contextual view $G$ is processed by the skim branch which is a lengthy sequence aimed at capturing as much of the entire video as possible. Next, an informative sampling module is employed to sample the instructive frames $C$ for the focus branch. The instructive frames $C$ and the fine-grained views $F$ are passed through the focus branch and encoded as feature $X_C$ and $X_F^i$ respectively. Max pooling is applied to extract critical guidance information $Z$ from feature $X_C$. The long-short adaptive guidance (LSAG) block integrates it with the feature $X_F^i$ to help differentiate the action-relevant and -irrelevant features. Mean Square Error (MSE) loss is utilized to supervise the learning process of the two branches separately.
  • Figure 3: The implementations of different informative sampling strategies. It illustrates the sampling results of the (a) random sampling strategy, (b) uniform sampling strategy, and (c) top $N_C$ sampling strategy.
  • Figure 4: The long-short adaptive guidance (LSAG) block integrates critical guidance information $Z$ with fine-grained feature embedding $X_F^i$.
  • Figure 5: The example of performing specified action counting using SkimFocusNet.
  • ...and 4 more figures