Table of Contents
Fetching ...

A Graph-Based Approach for Category-Agnostic Pose Estimation

Or Hirschorn, Shai Avidan

TL;DR

Traditional pose estimation models are restricted to predefined categories, limiting applicability to novel objects. This work introduces GraphCape, a graph-based approach that treats keypoints as a connected graph and uses a graph-transformer decoder to exploit geometric relations, enabling accurate pose localization for unseen categories with few support keypoints. Key contributions include (1) the GraphCape architecture with a Graph-FFN and a category-aware adjacency, (2) an updated MP-100 dataset with skeleton annotations for all categories, and (3) state-of-the-art performance in both 1-shot and 5-shot CAPE on MP-100, with improved robustness to occlusions and cross-category matching. The method advances CAPE by embedding structural priors into the decoding process, improving generalization to diverse objects and practical deployment in real-world, category-diverse scenes.

Abstract

Traditional 2D pose estimation models are limited by their category-specific design, making them suitable only for predefined object categories. This restriction becomes particularly challenging when dealing with novel objects due to the lack of relevant training data. To address this limitation, category-agnostic pose estimation (CAPE) was introduced. CAPE aims to enable keypoint localization for arbitrary object categories using a few-shot single model, requiring minimal support images with annotated keypoints. We present a significant departure from conventional CAPE techniques, which treat keypoints as isolated entities, by treating the input pose data as a graph. We leverage the inherent geometrical relations between keypoints through a graph-based network to break symmetry, preserve structure, and better handle occlusions. We validate our approach on the MP-100 benchmark, a comprehensive dataset comprising over 20,000 images spanning over 100 categories. Our solution boosts performance by 0.98% under a 1-shot setting, achieving a new state-of-the-art for CAPE. Additionally, we enhance the dataset with skeleton annotations. Our code and data are publicly available.

A Graph-Based Approach for Category-Agnostic Pose Estimation

TL;DR

Traditional pose estimation models are restricted to predefined categories, limiting applicability to novel objects. This work introduces GraphCape, a graph-based approach that treats keypoints as a connected graph and uses a graph-transformer decoder to exploit geometric relations, enabling accurate pose localization for unseen categories with few support keypoints. Key contributions include (1) the GraphCape architecture with a Graph-FFN and a category-aware adjacency, (2) an updated MP-100 dataset with skeleton annotations for all categories, and (3) state-of-the-art performance in both 1-shot and 5-shot CAPE on MP-100, with improved robustness to occlusions and cross-category matching. The method advances CAPE by embedding structural priors into the decoding process, improving generalization to diverse objects and practical deployment in real-world, category-diverse scenes.

Abstract

Traditional 2D pose estimation models are limited by their category-specific design, making them suitable only for predefined object categories. This restriction becomes particularly challenging when dealing with novel objects due to the lack of relevant training data. To address this limitation, category-agnostic pose estimation (CAPE) was introduced. CAPE aims to enable keypoint localization for arbitrary object categories using a few-shot single model, requiring minimal support images with annotated keypoints. We present a significant departure from conventional CAPE techniques, which treat keypoints as isolated entities, by treating the input pose data as a graph. We leverage the inherent geometrical relations between keypoints through a graph-based network to break symmetry, preserve structure, and better handle occlusions. We validate our approach on the MP-100 benchmark, a comprehensive dataset comprising over 20,000 images spanning over 100 categories. Our solution boosts performance by 0.98% under a 1-shot setting, achieving a new state-of-the-art for CAPE. Additionally, we enhance the dataset with skeleton annotations. Our code and data are publicly available.
Paper Structure (21 sections, 6 equations, 7 figures, 3 tables)

This paper contains 21 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 2: Architecture Overview. Our approach utilizes a pre-trained backbone to extract image features, followed by a transformer encoder that refines these features through self-attention. A similarity proposal generator is employed alongside a graph transformer decoder, enhancing keypoint localization accuracy with a focus on graph-oriented decoding.
  • Figure 3: Self-Attention Map Visualization. Comparing self-attention in decoders of three models: (a) CapeFormer-T trained on various object categories, (b) CapeFormer-T trained only on furniture objects, and (c) GraphCape trained on various object categories using a graph structure. Observe the edges between the legs and base of the chair. Notably, our graph-based method, despite being trained on multiple categories, exhibits similar attention patterns to models trained on single categories.
  • Figure 4: Graph FFN. The Transformer decoder is based on the original CapeFormer design, changing the feed-forward network from a simple MLP to a graph-based network. (a) A scheme of the transformer decoder which includes self-attention, cross-attention, and a feed-forward network. Self-attention encourages adaptive interactions among support keypoints, while cross-attention extracts localization information. (b) Previous FFN consisted of an MLP with 2 layers. (c) Our graph FFN includes a GCN layer and subsequent linear layers that enhance keypoint features and promote information exchange among known connected keypoints.
  • Figure 5: Qualitative Results. We visualize the keypoint predictions under a 1-shot setting. The left column denotes the support image with its corresponding skeleton. The second column is the ground-truth query keypoints. The following columns are results from POMNet, CapeFormer, CapeFormer-T, and our method.
  • Figure 6: Out-of-Distribution. Qualitative results using OOD samples, top is the support image and bottom the query. (a): Support and query are from the same OOD domain. (b): Support and query images are from different domains.
  • ...and 2 more figures