SCAPE: A Simple and Strong Category-Agnostic Pose Estimator
Yujia Liang, Zixuan Ye, Wenze Liu, Hao Lu
TL;DR
This work addresses category-agnostic pose estimation (CAPE), aiming to localize keypoints on unseen object categories with few exemplars. It proposes SCAPE, a simple transformer-based pipeline built from pure self-attention and an MLP regression head, augmented by two modules—Global Keypoint Feature Perceptor (GKP) and Keypoint Attention Refiner (KAR)—to improve attention quality and keypoint correlations. Empirical results on MP-100 show SCAPE surpassing prior CAPE methods in both 1-shot and 5-shot settings, with notable gains in accuracy and efficiency across backbones, and ablations confirm the value of implicit matching, GKP, and KAR. Collectively, the approach delivers a practical, scalable CAPE baseline and provides insights into leveraging attention mechanisms for cross-category pose estimation, with code and models released for reproducibility.
Abstract
Category-Agnostic Pose Estimation (CAPE) aims to localize keypoints on an object of any category given few exemplars in an in-context manner. Prior arts involve sophisticated designs, e.g., sundry modules for similarity calculation and a two-stage framework, or takes in extra heatmap generation and supervision. We notice that CAPE is essentially a task about feature matching, which can be solved within the attention process. Therefore we first streamline the architecture into a simple baseline consisting of several pure self-attention layers and an MLP regression head -- this simplification means that one only needs to consider the attention quality to boost the performance of CAPE. Towards an effective attention process for CAPE, we further introduce two key modules: i) a global keypoint feature perceptor to inject global semantic information into support keypoints, and ii) a keypoint attention refiner to enhance inter-node correlation between keypoints. They jointly form a Simple and strong Category-Agnostic Pose Estimator (SCAPE). Experimental results show that SCAPE outperforms prior arts by 2.2 and 1.3 PCK under 1-shot and 5-shot settings with faster inference speed and lighter model capacity, excelling in both accuracy and efficiency. Code and models are available at https://github.com/tiny-smart/SCAPE
