Table of Contents
Fetching ...

Meta-Point Learning and Refining for Category-Agnostic Pose Estimation

Junjie Chen, Jiebin Yan, Yuming Fang, Li Niu

TL;DR

The paper tackles category-agnostic pose estimation (CAPE) by shifting from reliance on a few support-keypoint features to learning class-agnostic meta-points that capture inherent keypoint information. It introduces a two-stage framework: first predicting meta-points without any support using learnable meta-embeddings and a Progressive Deformable Point Decoder, then assigning and refining these meta-points to the desired keypoints with support data, via bipartite matching and detail-enhancement with a second decoder. Key contributions include the first explicit learning of class-agnostic meta-points, the progressive deformable point decoding, a slack L1 regression loss to stabilize training, and strong empirical gains on the MP-100 CAPE benchmark, including cross-category generalization. This approach promises improved robustness to occlusion and support sparsity, enabling more reliable keypoint estimation across diverse object classes. Overall, MetaPoint advances CAPE by uncovering inherent keypoint structure and providing a practical, scalable refinement pipeline that outperforms prior methods on large-scale datasets.

Abstract

Category-agnostic pose estimation (CAPE) aims to predict keypoints for arbitrary classes given a few support images annotated with keypoints. Existing methods only rely on the features extracted at support keypoints to predict or refine the keypoints on query image, but a few support feature vectors are local and inadequate for CAPE. Considering that human can quickly perceive potential keypoints of arbitrary objects, we propose a novel framework for CAPE based on such potential keypoints (named as meta-points). Specifically, we maintain learnable embeddings to capture inherent information of various keypoints, which interact with image feature maps to produce meta-points without any support. The produced meta-points could serve as meaningful potential keypoints for CAPE. Due to the inevitable gap between inherency and annotation, we finally utilize the identities and details offered by support keypoints to assign and refine meta-points to desired keypoints in query image. In addition, we propose a progressive deformable point decoder and a slacked regression loss for better prediction and supervision. Our novel framework not only reveals the inherency of keypoints but also outperforms existing methods of CAPE. Comprehensive experiments and in-depth studies on large-scale MP-100 dataset demonstrate the effectiveness of our framework.

Meta-Point Learning and Refining for Category-Agnostic Pose Estimation

TL;DR

The paper tackles category-agnostic pose estimation (CAPE) by shifting from reliance on a few support-keypoint features to learning class-agnostic meta-points that capture inherent keypoint information. It introduces a two-stage framework: first predicting meta-points without any support using learnable meta-embeddings and a Progressive Deformable Point Decoder, then assigning and refining these meta-points to the desired keypoints with support data, via bipartite matching and detail-enhancement with a second decoder. Key contributions include the first explicit learning of class-agnostic meta-points, the progressive deformable point decoding, a slack L1 regression loss to stabilize training, and strong empirical gains on the MP-100 CAPE benchmark, including cross-category generalization. This approach promises improved robustness to occlusion and support sparsity, enabling more reliable keypoint estimation across diverse object classes. Overall, MetaPoint advances CAPE by uncovering inherent keypoint structure and providing a practical, scalable refinement pipeline that outperforms prior methods on large-scale datasets.

Abstract

Category-agnostic pose estimation (CAPE) aims to predict keypoints for arbitrary classes given a few support images annotated with keypoints. Existing methods only rely on the features extracted at support keypoints to predict or refine the keypoints on query image, but a few support feature vectors are local and inadequate for CAPE. Considering that human can quickly perceive potential keypoints of arbitrary objects, we propose a novel framework for CAPE based on such potential keypoints (named as meta-points). Specifically, we maintain learnable embeddings to capture inherent information of various keypoints, which interact with image feature maps to produce meta-points without any support. The produced meta-points could serve as meaningful potential keypoints for CAPE. Due to the inevitable gap between inherency and annotation, we finally utilize the identities and details offered by support keypoints to assign and refine meta-points to desired keypoints in query image. In addition, we propose a progressive deformable point decoder and a slacked regression loss for better prediction and supervision. Our novel framework not only reveals the inherency of keypoints but also outperforms existing methods of CAPE. Comprehensive experiments and in-depth studies on large-scale MP-100 dataset demonstrate the effectiveness of our framework.
Paper Structure (19 sections, 9 equations, 5 figures, 5 tables)

This paper contains 19 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of existing methods and our method. (a): Existing methods only rely on the local features at support keypoints to predict or refine the keypoints. (b): Our method employs learnable embeddings to capture inherent information and produces meta-points without support. Then our method assigns and refines meta-points according to support keypoints.
  • Figure 2: Our framework employs two stages to predict meta-points and desired keypoints. In the first stage, the learnable meta-embeddings interact with query feature maps via our progressive deformable point decoder to mine inherent information to predict meta-points and their visibilities. In the second stage, the meta-points are assigned with identities according to the given support keypoints. After that, the assigned meta-points are refined to desired keypoints based the support features and mined inherent information via another point decoder.
  • Figure 3: Illustration of decoder layer in our Progressive Deformable Point Decoder. The input embeddings first interact with each other via a self-attention module. After that, the offsets and weights are predicted to mine fine-grained features on input feature maps. Finally, a deformable attention module uses input points as reference to refine the embeddings and points.
  • Figure 4: In each row, we show the GT keypoints and predictions of a pair of support image and query image. The left two columns show the GT keypoints and estimated meta-points on support image. The middle two columns show the GT keypoints and estimated meta-points on query image. The right two columns show the final keypoints predicted by ours and CapeFormer. We employ the small black arrows to indicate the deviations to GT. The radii of drew meta-points indicate their visibilities, and the assigned meta-points are encircled with red.
  • Figure 5: In each row, we track one meta-embedding and visualize its meta-point on various samples if the visibilities are greater than $0.5$.