Table of Contents
Fetching ...

X-Pose: Detecting Any Keypoints

Jie Yang, Ailing Zeng, Ruimao Zhang, Lei Zhang

TL;DR

X-Pose tackles open-world, multi-object keypoint detection by introducing a fully end-to-end framework that leverages multi-modal prompts to identify arbitrary keypoints across diverse object categories. It is trained on UniKPT, a unified 13-dataset collection with 338 keypoints over 1,237 categories, enabling strong text-image-keypoint alignment through cross-modality contrastive learning. The approach yields substantial improvements over state-of-the-art methods in AP and PCK and demonstrates strong in-the-wild generalization across image styles, poses, and categories. This work delivers a scalable dataset and a versatile, prompt-guided detector that supports both textual and visual inputs for fine-grained, open-world perception tasks.

Abstract

This work aims to address an advanced keypoint detection problem: how to accurately detect any keypoints in complex real-world scenarios, which involves massive, messy, and open-ended objects as well as their associated keypoints definitions. Current high-performance keypoint detectors often fail to tackle this problem due to their two-stage schemes, under-explored prompt designs, and limited training data. To bridge the gap, we propose X-Pose, a novel end-to-end framework with multi-modal (i.e., visual, textual, or their combinations) prompts to detect multi-object keypoints for any articulated (e.g., human and animal), rigid, and soft objects within a given image. Moreover, we introduce a large-scale dataset called UniKPT, which unifies 13 keypoint detection datasets with 338 keypoints across 1,237 categories over 400K instances. Training with UniKPT, X-Pose effectively aligns text-to-keypoint and image-to-keypoint due to the mutual enhancement of multi-modal prompts based on cross-modality contrastive learning. Our experimental results demonstrate that X-Pose achieves notable improvements of 27.7 AP, 6.44 PCK, and 7.0 AP compared to state-of-the-art non-promptable, visual prompt-based, and textual prompt-based methods in each respective fair setting. More importantly, the in-the-wild test demonstrates X-Pose's strong fine-grained keypoint localization and generalization abilities across image styles, object categories, and poses, paving a new path to multi-object keypoint detection in real applications. Our code and dataset are available at https://github.com/IDEA-Research/X-Pose.

X-Pose: Detecting Any Keypoints

TL;DR

X-Pose tackles open-world, multi-object keypoint detection by introducing a fully end-to-end framework that leverages multi-modal prompts to identify arbitrary keypoints across diverse object categories. It is trained on UniKPT, a unified 13-dataset collection with 338 keypoints over 1,237 categories, enabling strong text-image-keypoint alignment through cross-modality contrastive learning. The approach yields substantial improvements over state-of-the-art methods in AP and PCK and demonstrates strong in-the-wild generalization across image styles, poses, and categories. This work delivers a scalable dataset and a versatile, prompt-guided detector that supports both textual and visual inputs for fine-grained, open-world perception tasks.

Abstract

This work aims to address an advanced keypoint detection problem: how to accurately detect any keypoints in complex real-world scenarios, which involves massive, messy, and open-ended objects as well as their associated keypoints definitions. Current high-performance keypoint detectors often fail to tackle this problem due to their two-stage schemes, under-explored prompt designs, and limited training data. To bridge the gap, we propose X-Pose, a novel end-to-end framework with multi-modal (i.e., visual, textual, or their combinations) prompts to detect multi-object keypoints for any articulated (e.g., human and animal), rigid, and soft objects within a given image. Moreover, we introduce a large-scale dataset called UniKPT, which unifies 13 keypoint detection datasets with 338 keypoints across 1,237 categories over 400K instances. Training with UniKPT, X-Pose effectively aligns text-to-keypoint and image-to-keypoint due to the mutual enhancement of multi-modal prompts based on cross-modality contrastive learning. Our experimental results demonstrate that X-Pose achieves notable improvements of 27.7 AP, 6.44 PCK, and 7.0 AP compared to state-of-the-art non-promptable, visual prompt-based, and textual prompt-based methods in each respective fair setting. More importantly, the in-the-wild test demonstrates X-Pose's strong fine-grained keypoint localization and generalization abilities across image styles, object categories, and poses, paving a new path to multi-object keypoint detection in real applications. Our code and dataset are available at https://github.com/IDEA-Research/X-Pose.
Paper Structure (20 sections, 1 equation, 6 figures, 11 tables)

This paper contains 20 sections, 1 equation, 6 figures, 11 tables.

Figures (6)

  • Figure 1: In-the-wild test of X-Pose for any keypoint detection. We highlight the powerful detection performance from cross-category (the first row), multi-object (the second row), and cross-image-style (the third row) with various pose scenarios.
  • Figure 2: The overview architecture of X-Pose. Given an input image, X-Pose follows the coarse-to-fine strategy to detect keypoints of any object via textual or visual prompts.
  • Figure 3: The detailed illustration of a) Visual Prompt Encoder, b) Cross-Modality Interactive Encoder, and c) Cross-Modality Interactive Decoder. In (b) and (c), blue modules are newly introduced to incorporate prompt interactions.
  • Figure 4: In-the-wild test of X-Pose for any face keypoint detection. We showcase the model's strong generalization to detect face keypoints of any object with 68 keypoint definitions, despite being trained only on the person's face with these definitions.
  • Figure 5: Visualization of the detected keypoints via X-Pose on UniKPT.
  • ...and 1 more figures