Table of Contents
Fetching ...

FCA-RAC: First Cycle Annotated Repetitive Action Counting

Jiada Lu, WeiWei Zhou, Xiang Qian, Dongze Lian, Yanyu Xu, Weifeng Wang, Lina Cao, Shenghua Gao

TL;DR

Repetitive action counting suffers from limited action diversity in existing datasets, hindering generalization to unseen actions. The authors propose FCA-RAC, a four-part framework comprising First Cycle Annotated labeling, Dynamic Input Sampling, Multi-Temporal Granularity Convolution, and Training Knowledge Augmentation to exploit the relationship between the first action cycle and subsequent actions. Empirical results on RepCount-A, Countix-AV, UCFRep, and QUVA show superior MAE and OBO scores and strong generalization to unseen actions, aided by a nearest-neighbor embedding mechanism in TKA that reduces reliance on test-time adaptation. The approach delivers robust action counting across seen and unseen actions, with practical implications for real-world RAC tasks including fitness analytics and video understanding.

Abstract

Repetitive action counting quantifies the frequency of specific actions performed by individuals. However, existing action-counting datasets have limited action diversity, potentially hampering model performance on unseen actions. To address this issue, we propose a framework called First Cycle Annotated Repetitive Action Counting (FCA-RAC). This framework contains 4 parts: 1) a labeling technique that annotates each training video with the start and end of the first action cycle, along with the total action count. This technique enables the model to capture the correlation between the initial action cycle and subsequent actions; 2) an adaptive sampling strategy that maximizes action information retention by adjusting to the speed of the first annotated action cycle in videos; 3) a Multi-Temporal Granularity Convolution (MTGC) module, that leverages the muli-scale first action as a kernel to convolve across the entire video. This enables the model to capture action variations at different time scales within the video; 4) a strategy called Training Knowledge Augmentation (TKA) that exploits the annotated first action cycle information from the entire dataset. This allows the network to harness shared characteristics across actions effectively, thereby enhancing model performance and generalizability to unseen actions. Experimental results demonstrate that our approach achieves superior outcomes on RepCount-A and related datasets, highlighting the efficacy of our framework in improving model performance on seen and unseen actions. Our paper makes significant contributions to the field of action counting by addressing the limitations of existing datasets and proposing novel techniques for improving model generalizability.

FCA-RAC: First Cycle Annotated Repetitive Action Counting

TL;DR

Repetitive action counting suffers from limited action diversity in existing datasets, hindering generalization to unseen actions. The authors propose FCA-RAC, a four-part framework comprising First Cycle Annotated labeling, Dynamic Input Sampling, Multi-Temporal Granularity Convolution, and Training Knowledge Augmentation to exploit the relationship between the first action cycle and subsequent actions. Empirical results on RepCount-A, Countix-AV, UCFRep, and QUVA show superior MAE and OBO scores and strong generalization to unseen actions, aided by a nearest-neighbor embedding mechanism in TKA that reduces reliance on test-time adaptation. The approach delivers robust action counting across seen and unseen actions, with practical implications for real-world RAC tasks including fitness analytics and video understanding.

Abstract

Repetitive action counting quantifies the frequency of specific actions performed by individuals. However, existing action-counting datasets have limited action diversity, potentially hampering model performance on unseen actions. To address this issue, we propose a framework called First Cycle Annotated Repetitive Action Counting (FCA-RAC). This framework contains 4 parts: 1) a labeling technique that annotates each training video with the start and end of the first action cycle, along with the total action count. This technique enables the model to capture the correlation between the initial action cycle and subsequent actions; 2) an adaptive sampling strategy that maximizes action information retention by adjusting to the speed of the first annotated action cycle in videos; 3) a Multi-Temporal Granularity Convolution (MTGC) module, that leverages the muli-scale first action as a kernel to convolve across the entire video. This enables the model to capture action variations at different time scales within the video; 4) a strategy called Training Knowledge Augmentation (TKA) that exploits the annotated first action cycle information from the entire dataset. This allows the network to harness shared characteristics across actions effectively, thereby enhancing model performance and generalizability to unseen actions. Experimental results demonstrate that our approach achieves superior outcomes on RepCount-A and related datasets, highlighting the efficacy of our framework in improving model performance on seen and unseen actions. Our paper makes significant contributions to the field of action counting by addressing the limitations of existing datasets and proposing novel techniques for improving model generalizability.
Paper Structure (18 sections, 13 equations, 5 figures, 7 tables)

This paper contains 18 sections, 13 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: The comparison of the experiment setting. (a) Comparison between fully-annotated setting TransRAC and our FCA-RAC. The red boxes indicate annotated action cycles within the video. Both methods are trained by annotating the action number within the video. Besides, in the fully-annotated setting, the start and end frames of each action cycle are annotated as ground truth, whereas in FCA-RAC only the start and end frames of the first action cycle are annotated. At testing time, while no label information is available for the fully annotated setting, the first action cycle is annotated in our FCA-RAC method. (b) Comparison of model enhancement strategy. In Test Time Adaption LearningToCountEverything, the FC-V and V-V model is adapted to each video using the information provided by the first annotated cycle. In Training Knowledge Augmentation, our FCA-RAC model knowledge is augmented through the first annotated cycle from the training set.
  • Figure 2: FCA-RAC architecture. The video sequences are sampled according to Sec \ref{['subsec:sample']}. Then the embedding features are extracted by the encoder. In the pre-training stage, the first action cycle is scaled to 3,4,5 and used as a kernel to convolve with the entire videos. After concatenating the feature, we make it pass through the remaining network and output the density map. In the fine-tuning and inference stage, we use the Training Knowledge Augmentation strategy, where the network adds the k nearest instances of the first action cycle from the training dataset as kernel to convolve with the input video.
  • Figure 3: The result of FCA-RAC with and w/o Training Knowledge Augmentation (TKA) on the RepCount-A dataset. Top1 means using the top1 nearest action cycle from the training set for TKA.
  • Figure 4: Visualization of Training Knowledge Augmentation of seen and unseen action. The top row is the input samples, while the other three rows below are the nearest neighbors selected from the training set by the TKA strategy. For seen actions, the nearest neighbors are coming from the same type of actions. For unseen actions, the nearest neighbors are coming from the samples with similar actions.
  • Figure 5: Visualization of the density map of good and bad case our model predicted.