Table of Contents
Fetching ...

STEP: Simultaneous Tracking and Estimation of Pose for Animals and Humans

Shashikant Verma, Harish Katti, Soumyaratna Debnath, Yamuna Swamy, Shanmuganathan Raman

TL;DR

STEP introduces a Transformer-based discriminative model predictor for simultaneous tracking and pose estimation across diverse species, removing the need for per-frame keypoint inputs via Gaussian Map Soft Prediction (GMSP) and Offset Map Regression Adapter (OMRA). The approach encodes target state information with keypoint offsets, Gaussian maps, and learnable embeddings, and predicts target-model weights for localization, tracking, and keypoint regression, enabling end-to-end online inference with memory updates. Evaluations across datasets including APT36K, APT10K, CrowdPose, TriMouse, Marmoset, and Fish show STEP achieving competitive or superior pose-estimation metrics while maintaining high tracking performance, with strong results under occlusion and across natural and synthetic sequences. A case study on Awaji Monkey Center live streams demonstrates practical utility for behavioral analysis, and inference speeds around 63 FPS on high-end GPUs highlight the method’s potential for real-time applications.

Abstract

We introduce STEP, a novel framework utilizing Transformer-based discriminative model prediction for simultaneous tracking and estimation of pose across diverse animal species and humans. We are inspired by the fact that the human brain exploits spatiotemporal continuity and performs concurrent localization and pose estimation despite the specialization of brain areas for form and motion processing. Traditional discriminative models typically require predefined target states for determining model weights, a challenge we address through Gaussian Map Soft Prediction (GMSP) and Offset Map Regression Adapter (OMRA) Modules. These modules remove the necessity of keypoint target states as input, streamlining the process. Our method starts with a known target state in the initial frame of a given video sequence. It then seamlessly tracks the target and estimates keypoints of anatomical importance as output for subsequent frames. Unlike prevalent top-down pose estimation methods, our approach doesn't rely on per-frame target detections due to its tracking capability. This facilitates a significant advancement in inference efficiency and potential applications. We train and validate our approach on datasets encompassing diverse species. Our experiments demonstrate superior results compared to existing methods, opening doors to various applications, including but not limited to action recognition and behavioral analysis.

STEP: Simultaneous Tracking and Estimation of Pose for Animals and Humans

TL;DR

STEP introduces a Transformer-based discriminative model predictor for simultaneous tracking and pose estimation across diverse species, removing the need for per-frame keypoint inputs via Gaussian Map Soft Prediction (GMSP) and Offset Map Regression Adapter (OMRA). The approach encodes target state information with keypoint offsets, Gaussian maps, and learnable embeddings, and predicts target-model weights for localization, tracking, and keypoint regression, enabling end-to-end online inference with memory updates. Evaluations across datasets including APT36K, APT10K, CrowdPose, TriMouse, Marmoset, and Fish show STEP achieving competitive or superior pose-estimation metrics while maintaining high tracking performance, with strong results under occlusion and across natural and synthetic sequences. A case study on Awaji Monkey Center live streams demonstrates practical utility for behavioral analysis, and inference speeds around 63 FPS on high-end GPUs highlight the method’s potential for real-time applications.

Abstract

We introduce STEP, a novel framework utilizing Transformer-based discriminative model prediction for simultaneous tracking and estimation of pose across diverse animal species and humans. We are inspired by the fact that the human brain exploits spatiotemporal continuity and performs concurrent localization and pose estimation despite the specialization of brain areas for form and motion processing. Traditional discriminative models typically require predefined target states for determining model weights, a challenge we address through Gaussian Map Soft Prediction (GMSP) and Offset Map Regression Adapter (OMRA) Modules. These modules remove the necessity of keypoint target states as input, streamlining the process. Our method starts with a known target state in the initial frame of a given video sequence. It then seamlessly tracks the target and estimates keypoints of anatomical importance as output for subsequent frames. Unlike prevalent top-down pose estimation methods, our approach doesn't rely on per-frame target detections due to its tracking capability. This facilitates a significant advancement in inference efficiency and potential applications. We train and validate our approach on datasets encompassing diverse species. Our experiments demonstrate superior results compared to existing methods, opening doors to various applications, including but not limited to action recognition and behavioral analysis.

Paper Structure

This paper contains 28 sections, 7 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The proposed STEP framework for simultaneous tracking and pose estimation. (a) the full architecture, (b) the calculation of Target State Encodings, (c) the architecture of the offset map regression adapter (OMRA) module, and (e) the standard transformer architecture comprising multi-headed self-attention blocks.
  • Figure 2: Modules of STEP architecture are shown (a) GMSP Module, (b) Localizer Module $\Lambda_L$, (c) Keypoint Localizer Module $\Lambda_K$, and (d) Bounding Box Regressor Module $\Lambda_B$.
  • Figure 3: For a video sequence, columns display target score maps $\hat{K}_\text{gm}$ when initiated with bounding boxes, as depicted in (a) and (b). In (c), our STEP framework runs concurrently for each bounding box to track and estimate the pose of respective targets. Notably, observe the robustness of STEP in estimating keypoints, particularly in scenarios involving occlusion. A failure case is showcased in the last frame of (c) for the red bounded target, stemming from ambiguous activations within the red target's score map, as evident in the last frame of (a).
  • Figure 4: Comparison of Mean Squared Error (MSE) and Object Keypoint Similarity (OKS) for individual keypoints against various existing methods: VitPose xu2022vitpose, YOLOv8 jocher2023yolo, TransPose yang2021transpose, ViPNAS xu2021vipnas, DeepLabCut lauer2022multi, HRNet sun2019deep, and RTMPose rtmpose. Notably, the marmoset dataset features top-view images of marmosets in a cage with infrequent occurrences of lower limb keypoints. The inherent imbalance in training data leads to notable inaccuracies across all methods for the four bottom-limb keypoints. Moreover, the bottom row displays the output of our approach on a frame from the corresponding dataset's video sequence, where $*$-marked frames are from synthetic sequences.
  • Figure 6: We demonstrate tracking and pose estimation during the occlusion of a target human instance by another human with a similar appearance. Zoom-in is recommended for clarity. Panel (a) illustrates memory updates occurring only when at least 50% of keypoints and the bounding box localization are predicted with a probability greater than $\tau_m = 0.6$. In contrast, panel (b) shows memory updates being performed regardless of the probability value. In both panels, frames with low confidence, which may introduce or propagate errors, are indicated by a red bounding box. In panel (b), STEP wrongly adjusts its confidence from the original target to the new instance based on the memory state. Subsequently, it starts tracking and performing pose estimation on the new object.
  • ...and 2 more figures