Table of Contents
Fetching ...

SCAPE: A Simple and Strong Category-Agnostic Pose Estimator

Yujia Liang, Zixuan Ye, Wenze Liu, Hao Lu

TL;DR

This work addresses category-agnostic pose estimation (CAPE), aiming to localize keypoints on unseen object categories with few exemplars. It proposes SCAPE, a simple transformer-based pipeline built from pure self-attention and an MLP regression head, augmented by two modules—Global Keypoint Feature Perceptor (GKP) and Keypoint Attention Refiner (KAR)—to improve attention quality and keypoint correlations. Empirical results on MP-100 show SCAPE surpassing prior CAPE methods in both 1-shot and 5-shot settings, with notable gains in accuracy and efficiency across backbones, and ablations confirm the value of implicit matching, GKP, and KAR. Collectively, the approach delivers a practical, scalable CAPE baseline and provides insights into leveraging attention mechanisms for cross-category pose estimation, with code and models released for reproducibility.

Abstract

Category-Agnostic Pose Estimation (CAPE) aims to localize keypoints on an object of any category given few exemplars in an in-context manner. Prior arts involve sophisticated designs, e.g., sundry modules for similarity calculation and a two-stage framework, or takes in extra heatmap generation and supervision. We notice that CAPE is essentially a task about feature matching, which can be solved within the attention process. Therefore we first streamline the architecture into a simple baseline consisting of several pure self-attention layers and an MLP regression head -- this simplification means that one only needs to consider the attention quality to boost the performance of CAPE. Towards an effective attention process for CAPE, we further introduce two key modules: i) a global keypoint feature perceptor to inject global semantic information into support keypoints, and ii) a keypoint attention refiner to enhance inter-node correlation between keypoints. They jointly form a Simple and strong Category-Agnostic Pose Estimator (SCAPE). Experimental results show that SCAPE outperforms prior arts by 2.2 and 1.3 PCK under 1-shot and 5-shot settings with faster inference speed and lighter model capacity, excelling in both accuracy and efficiency. Code and models are available at https://github.com/tiny-smart/SCAPE

SCAPE: A Simple and Strong Category-Agnostic Pose Estimator

TL;DR

This work addresses category-agnostic pose estimation (CAPE), aiming to localize keypoints on unseen object categories with few exemplars. It proposes SCAPE, a simple transformer-based pipeline built from pure self-attention and an MLP regression head, augmented by two modules—Global Keypoint Feature Perceptor (GKP) and Keypoint Attention Refiner (KAR)—to improve attention quality and keypoint correlations. Empirical results on MP-100 show SCAPE surpassing prior CAPE methods in both 1-shot and 5-shot settings, with notable gains in accuracy and efficiency across backbones, and ablations confirm the value of implicit matching, GKP, and KAR. Collectively, the approach delivers a practical, scalable CAPE baseline and provides insights into leveraging attention mechanisms for cross-category pose estimation, with code and models released for reproducibility.

Abstract

Category-Agnostic Pose Estimation (CAPE) aims to localize keypoints on an object of any category given few exemplars in an in-context manner. Prior arts involve sophisticated designs, e.g., sundry modules for similarity calculation and a two-stage framework, or takes in extra heatmap generation and supervision. We notice that CAPE is essentially a task about feature matching, which can be solved within the attention process. Therefore we first streamline the architecture into a simple baseline consisting of several pure self-attention layers and an MLP regression head -- this simplification means that one only needs to consider the attention quality to boost the performance of CAPE. Towards an effective attention process for CAPE, we further introduce two key modules: i) a global keypoint feature perceptor to inject global semantic information into support keypoints, and ii) a keypoint attention refiner to enhance inter-node correlation between keypoints. They jointly form a Simple and strong Category-Agnostic Pose Estimator (SCAPE). Experimental results show that SCAPE outperforms prior arts by 2.2 and 1.3 PCK under 1-shot and 5-shot settings with faster inference speed and lighter model capacity, excelling in both accuracy and efficiency. Code and models are available at https://github.com/tiny-smart/SCAPE
Paper Structure (24 sections, 6 equations, 21 figures, 12 tables)

This paper contains 24 sections, 6 equations, 21 figures, 12 tables.

Figures (21)

  • Figure 1: Comparison with prior arts. (a) POMNet xu2022pose relies on similarity matching to obtain similarity maps and infer keypoint coordinates. (b) CapeFormer shi2023matching presents a two-stage framework, iteratively refining unreliable initial predictions. (c) Our SCAPE employs self-attention for feature interaction and directly regresses keypoints without explicit matching. For better similarity matching, we introduce two modules. The circle size indicates the model parameters (excluding backbone). The y-axis represents the accuracy (PCK), and the x-axis indicates the inference speed (FPS). Our model facilitates the seamless integration of state-of-the-art self-supervised learning techniques for scaling Vision Transformers (ViTs).
  • Figure 2: Implicit attention map is closer to ground truth than explicit similarity map. The second column represents the similarity map obtained from CapeFormer, and the third column denotes the final layer of attention map between support keypoints and query image in the first stage of CapeFormer.
  • Figure 3: Visualization of the last three attention maps between keypoints and query image and result. (left) The query target is the right foot; however, the attention easily focuses on the left foot with a similar appearance, leading to inaccurate estimation. GKP injects global information for the support keypoints, equipping it with relative positional information to distinguish left and right keypoints. (right) However, GKP struggles when facing shelter. By modeling the correlation between keypoints, we enable the inference from visible to invisible. Despite the right knee is sheltered, the model can infer its correct position by locating the right foot.
  • Figure 4: Technical pipeline of SCAPE. SCAPE consists of four modules: feature extractor, global keypoint feature perceptor, feature interactor, and regression head. The feature extractor is identical to the poir method. After processing support and query features through the backbone, support keypoint tokens are created by a weighted sum of the labeled keypoint heatmap and support features. Support keypoints (local) are then combined with a support keypoint identifier (form Capeformer), while query image tokens are formed from the query features with positional embedding. Support keypoints feed in Global Keypoint Perceptor to cross-attend the support image to obtain global support keypoints. Next, an interaction module composed of multi-head refined self-attention (MHRSA) conduct the interaction between the keypoint and query features. And the Keypoint Attention Refiner (KAR) is inserted into each self-attention stage to refine the attention maps among keypoints. Finally, a simple MLP head are used for support keypoint to regress keypoint coordinates.
  • Figure 5: Visualization of the attention maps of support keypoints. Initially, due to significant disparities between support and query image, accurate matching is challenging, the attention process focuses on self-attention among keypoints (with yellow highlight) to build contextual information. Later attention leans towards implicit matching between keypoints to query images.
  • ...and 16 more figures