Table of Contents
Fetching ...

Edge Weight Prediction For Category-Agnostic Pose Estimation

Or Hirschorn, Shai Avidan

TL;DR

EdgeCape tackles category-agnostic pose estimation by learning weighted pose-graphs that refine a user-provided unweighted graph $A_{prior}$. It refines structure-aware features via a dual-attention decoder and then predicts residual edges as $\Delta A$, combining them with $A_{prior}$ through a learnable scaling to form a symmetric, row-normalized adjacency $\tilde{A}$. A self-supervised masking strategy provides adjacency supervision, while a Markov Attention Bias integrates graph-distance information into self-attention using multi-hop relations $\tilde{A}, \tilde{A}^2, \dots$, enabling robust, global spatial reasoning. Evaluated on MP-100, EdgeCape achieves state-of-the-art results in 1-shot and leads among similar-sized methods in 5-shot, with notable gains at multiple thresholds and strong robustness to noisy graph priors. The approach offers practical improvements for CAPE by combining learned structural priors with efficient inference, and its publicly available code facilitates broader adoption.

Abstract

Category-Agnostic Pose Estimation (CAPE) localizes keypoints across diverse object categories with a single model, using one or a few annotated support images. Recent works have shown that using a pose graph (i.e., treating keypoints as nodes in a graph rather than isolated points) helps handle occlusions and break symmetry. However, these methods assume a static pose graph with equal-weight edges, leading to suboptimal results. We introduce EdgeCape, a novel framework that overcomes these limitations by predicting the graph's edge weights which optimizes localization. To further leverage structural priors, we propose integrating Markovian Structural Bias, which modulates the self-attention interaction between nodes based on the number of hops between them. We show that this improves the model's ability to capture global spatial dependencies. Evaluated on the MP-100 benchmark, which includes 100 categories and over 20K images, EdgeCape achieves state-of-the-art results in the 1-shot setting and leads among similar-sized methods in the 5-shot setting, significantly improving keypoint localization accuracy. Our code is publicly available.

Edge Weight Prediction For Category-Agnostic Pose Estimation

TL;DR

EdgeCape tackles category-agnostic pose estimation by learning weighted pose-graphs that refine a user-provided unweighted graph . It refines structure-aware features via a dual-attention decoder and then predicts residual edges as , combining them with through a learnable scaling to form a symmetric, row-normalized adjacency . A self-supervised masking strategy provides adjacency supervision, while a Markov Attention Bias integrates graph-distance information into self-attention using multi-hop relations , enabling robust, global spatial reasoning. Evaluated on MP-100, EdgeCape achieves state-of-the-art results in 1-shot and leads among similar-sized methods in 5-shot, with notable gains at multiple thresholds and strong robustness to noisy graph priors. The approach offers practical improvements for CAPE by combining learned structural priors with efficient inference, and its publicly available code facilitates broader adoption.

Abstract

Category-Agnostic Pose Estimation (CAPE) localizes keypoints across diverse object categories with a single model, using one or a few annotated support images. Recent works have shown that using a pose graph (i.e., treating keypoints as nodes in a graph rather than isolated points) helps handle occlusions and break symmetry. However, these methods assume a static pose graph with equal-weight edges, leading to suboptimal results. We introduce EdgeCape, a novel framework that overcomes these limitations by predicting the graph's edge weights which optimizes localization. To further leverage structural priors, we propose integrating Markovian Structural Bias, which modulates the self-attention interaction between nodes based on the number of hops between them. We show that this improves the model's ability to capture global spatial dependencies. Evaluated on the MP-100 benchmark, which includes 100 categories and over 20K images, EdgeCape achieves state-of-the-art results in the 1-shot setting and leads among similar-sized methods in the 5-shot setting, significantly improving keypoint localization accuracy. Our code is publicly available.

Paper Structure

This paper contains 41 sections, 13 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Given a support image, keypoints definition, and skeletal relations (support data) from any category, our model localizes the keypoints on a query image. Previous methods treat keypoints as isolated (CapeFormer shi2023matching) or use unweighted graphs (GraphCape hirschorn2024). We, in contrast, predict weighted graphs that lead to better localization.
  • Figure 2: Which of these pose-graphs is optimal? As edge placement can be ambiguous, we aim to learn the optimal pose-graph for category-agnostic keypoint localization. We show several graph annotations for the same image. The second-to-last (cyan) serves as input to our model; the last (orange) is our predicted pose-graph. While not visually superior, it achieves the best performance.
  • Figure 3: Framework Overview. Our model consists of two main components: a pose-graph predictor (visualized in Figure \ref{['fig:skeleton_predict']}) and a graph-based keypoint predictor. The pose-graph predictor refines the prior graph input by predicting residual connections. The graph-based keypoint predictor then utilizes the predicted keypoint relations, improving localization across diverse object structures.
  • Figure 4: Qualitative Comparison. We visualize keypoint predictions for the 1-shot setting. The left column shows the support data, followed by ground-truth query keypoints, and results of different methods. Our method performs best by leveraging predicted weighted pose-graphs, which serve as more effective structural priors for keypoint localization.
  • Figure 5: Predicted Pose-Graphs. We visualize predicted graphs: left column shows input $A_{\text{prior}}$, right shows output $A$. Line width reflects edge weight. Observe the slimmer table-base edges and the new facial edges. Our model prunes symmetric part links and forms connections that aid localization.
  • ...and 11 more figures