CapeX: Category-Agnostic Pose Estimation from Textual Point Explanation

Matan Rusanovsky; Or Hirschorn; Shai Avidan

CapeX: Category-Agnostic Pose Estimation from Textual Point Explanation

Matan Rusanovsky, Or Hirschorn, Shai Avidan

TL;DR

CapeX tackles category-agnostic pose estimation by replacing support-image guidance with a text-based pose-graph where nodes carry textual descriptions. The method fuses image features (via a SwinV2 backbone) with open-vocabulary text embeddings (via a frozen text backbone) in a three-block transformer and graph transformer decoder, optimizing with $L_{heatmap}$ and $L_{offset}$. On MP-100, CapeX achieves a new state-of-the-art in the 1-shot setting with $PCK_{0.2}$ averaging $91.50$, without finetuning the text backbone, and demonstrates robustness to text variations and moderate occlusions. The work also augments MP-100 with text annotations for keypoints, enabling richer open-vocabulary evaluation and highlighting remaining challenges in novel-category generalization and extreme occlusion scenarios.

Abstract

Conventional 2D pose estimation models are constrained by their design to specific object categories. This limits their applicability to predefined objects. To overcome these limitations, category-agnostic pose estimation (CAPE) emerged as a solution. CAPE aims to facilitate keypoint localization for diverse object categories using a unified model, which can generalize from minimal annotated support images. Recent CAPE works have produced object poses based on arbitrary keypoint definitions annotated on a user-provided support image. Our work departs from conventional CAPE methods, which require a support image, by adopting a text-based approach instead of the support image. Specifically, we use a pose-graph, where nodes represent keypoints that are described with text. This representation takes advantage of the abstraction of text descriptions and the structure imposed by the graph. Our approach effectively breaks symmetry, preserves structure, and improves occlusion handling. We validate our novel approach using the MP-100 benchmark, a comprehensive dataset spanning over 100 categories and 18,000 images. Under a 1-shot setting, our solution achieves a notable performance boost of 1.07\%, establishing a new state-of-the-art for CAPE. Additionally, we enrich the dataset by providing text description annotations, further enhancing its utility for future research.

CapeX: Category-Agnostic Pose Estimation from Textual Point Explanation

TL;DR

and

. On MP-100, CapeX achieves a new state-of-the-art in the 1-shot setting with

averaging

, without finetuning the text backbone, and demonstrates robustness to text variations and moderate occlusions. The work also augments MP-100 with text annotations for keypoints, enabling richer open-vocabulary evaluation and highlighting remaining challenges in novel-category generalization and extreme occlusion scenarios.

Abstract

Paper Structure (23 sections, 3 equations, 14 figures, 2 tables)

This paper contains 23 sections, 3 equations, 14 figures, 2 tables.

Introduction
Related Work
Category-Agnostic Pose Estimation
Open-Vocabulary Models
Method
Open-Vocabulary Keypoint Detection
Text Prompts as Visual Queues
Experiments
Implementation Details
Benchmark Results
Ablation Study
Text Modifications
Occlusions and Levels of Abstraction
Out of Distribution Query Images
Limitations
...and 8 more sections

Figures (14)

Figure 1: CapeX in action: Given support keypoints text descriptions (in pink) and a corresponding skeleton (not shown), our model localizes the skeleton on query images. In the first row, there are few input support text descriptions, and below each support input, there is a query image from the test set on the left (green), and an AI generated query image on the right (blue). Our approach does not require a support image. Instead, it utilizes the abstraction power of text to improve keypoint localization.
Figure 2: Different Open-Vocabulary Tasks: We show three different open-vocabulary tasks: (a) object detection, (b) part segmentation, and (c) keypoint detection. Object detection identifies objects and locations, segmentation provides pixel-level semantic details, and keypoint detection offers finer localization than object detection while being more practical for localization than segmentation.
Figure 3: Architecture overview: Our framework uses image and text backbones benefiting from both accurate and abstract descriptions respectively. The extracted feature descriptors are forwarded into the transformer encoder that refines them. The refined features are passed into the proposal generator alongside the graph transformer decoder, utilizing the graph structure within the data.
Figure 4: Qualitative results: From left to right: support images that are used by the competitors, CapeFormer-S, Pose Anything-S, our model, and the GT. Support text descriptions used by our model are not shown. Main differences are pointed out using arrows.
Figure 5: Modified text descriptions: Top is the support keypoints text descriptions. Left is a synonym words test, middle is a translation test and right is typo test. Below each description, query output(s) are presented. Each node in the presented graph is the average positions between the original and modified text descriptions. The diameter represents the distance between the positions.
...and 9 more figures

CapeX: Category-Agnostic Pose Estimation from Textual Point Explanation

TL;DR

Abstract

CapeX: Category-Agnostic Pose Estimation from Textual Point Explanation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)