Table of Contents
Fetching ...

GMFL-Net: A Global Multi-geometric Feature Learning Network for Repetitive Action Counting

Jun Li, Jinying Wu, Qiming Li, Feifei Guo

TL;DR

A GMFL-Net that aims to improve information representation by fusing multi-geometric features, and learning the semantic similarity among the input multi-geometric features, and enhances the inter-dependencies between point-wise and channel-wise elements to synthesise a comprehensive and most representative global feature representation.

Abstract

With the continuous development of deep learning, the field of repetitive action counting is gradually gaining notice from many researchers. Extraction of pose keypoints using human pose estimation networks is proven to be an effective pose-level method. However, existing pose-level methods suffer from the shortcomings that the single coordinate is not stable enough to handle action distortions due to changes in camera viewpoints, thus failing to accurately identify salient poses, and is vulnerable to misdetection during the transition from the exception to the actual action. To overcome these problems, we propose a simple but efficient Global Multi-geometric Feature Learning Network (GMFL-Net). Specifically, we design a MIA-Module that aims to improve information representation by fusing multi-geometric features, and learning the semantic similarity among the input multi-geometric features. Then, to improve the feature representation from a global perspective, we also design a GBFL-Module that enhances the inter-dependencies between point-wise and channel-wise elements and combines them with the rich local information generated by the MIA-Module to synthesise a comprehensive and most representative global feature representation. In addition, considering the insufficient existing dataset, we collect a new dataset called Countix-Fitness-pose (https://github.com/Wantong66/Countix-Fitness) which contains different cycle lengths and exceptions, a test set with longer duration, and annotate it with fine-grained annotations at the pose-level. We also add two new action classes, namely lunge and rope push-down. Finally, extensive experiments on the challenging RepCount-pose, UCFRep-pose, and Countix-Fitness-pose benchmarks show that our proposed GMFL-Net achieves state-of-the-art performance.

GMFL-Net: A Global Multi-geometric Feature Learning Network for Repetitive Action Counting

TL;DR

A GMFL-Net that aims to improve information representation by fusing multi-geometric features, and learning the semantic similarity among the input multi-geometric features, and enhances the inter-dependencies between point-wise and channel-wise elements to synthesise a comprehensive and most representative global feature representation.

Abstract

With the continuous development of deep learning, the field of repetitive action counting is gradually gaining notice from many researchers. Extraction of pose keypoints using human pose estimation networks is proven to be an effective pose-level method. However, existing pose-level methods suffer from the shortcomings that the single coordinate is not stable enough to handle action distortions due to changes in camera viewpoints, thus failing to accurately identify salient poses, and is vulnerable to misdetection during the transition from the exception to the actual action. To overcome these problems, we propose a simple but efficient Global Multi-geometric Feature Learning Network (GMFL-Net). Specifically, we design a MIA-Module that aims to improve information representation by fusing multi-geometric features, and learning the semantic similarity among the input multi-geometric features. Then, to improve the feature representation from a global perspective, we also design a GBFL-Module that enhances the inter-dependencies between point-wise and channel-wise elements and combines them with the rich local information generated by the MIA-Module to synthesise a comprehensive and most representative global feature representation. In addition, considering the insufficient existing dataset, we collect a new dataset called Countix-Fitness-pose (https://github.com/Wantong66/Countix-Fitness) which contains different cycle lengths and exceptions, a test set with longer duration, and annotate it with fine-grained annotations at the pose-level. We also add two new action classes, namely lunge and rope push-down. Finally, extensive experiments on the challenging RepCount-pose, UCFRep-pose, and Countix-Fitness-pose benchmarks show that our proposed GMFL-Net achieves state-of-the-art performance.
Paper Structure (30 sections, 17 equations, 6 figures, 8 tables)

This paper contains 30 sections, 17 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Coordinates $P_i$ ($i=1,2,...f$) contains the coordinates of the $N$ joints in each frame $f$. Distance $D_i$ ($i=1,2,...f$) contains the distance between every two joints in each frame $f$. Angle $A_i$ ($i=1,2,...f$) contains the angle between every three joints in each frame $f$. To keep the number of features consistent with the joint coordinates, we randomly selected $N$ distance and angle features. A darker red color in the graph means it is more likely to represent salient pose I, while a darker blue color means it is more likely to represent salient pose II.
  • Figure 2: The overall architecture of GMFL-Net includes the MIA-Module, GBFL-Module, Classification Head, and RC-Module. The $(x, y, z)$ in the figure represent joint coordinates, $\alpha_1, \alpha_2, \alpha_3$ represent angles between joints, and $d_1, d_2, d_3$ represent distances between joints. $P$ represents the coordinate information and $G$ represents the rest of the geometric information, i.e., angles and distances between joints.
  • Figure 3: Illustration of Triplet Margin Loss. We use it to improve the Encoder. After training, the distance between the anchor and the positive example decreases, while the distance between the anchor and the negative example increases.
  • Figure 4: Illustration of the mechanism of RC-Module. We scan all frames and obtain scores $S_c$ for specific action class. In this process, we set entry thresholds and exit thresholds which are used to distinguish between two salient actions. When the score of Salient Pose I exceeds the entry threshold and the score of Salient Pose II is below the exit threshold, the mechanism of RC-Module triggers. The count is added one whenever Salient Poses I and II are triggered sequentially.
  • Figure 5: Illustration of the six action classes in our proposed dataset and the implementation of the PSR mechanism poserac. We need to accurately select two salient poses that represent the completion of an action in the given videos, labelled as salient pose I and salient pose II. For example, at frame 87 of the given video, we select this frame as the representative of salient pose I, and at frame 145, we select this frame as the representative of salient pose II.
  • ...and 1 more figures