Table of Contents
Fetching ...

Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching

Hao Zhang, Lumin Xu, Shenqi Lai, Wenqi Shao, Nanning Zheng, Ping Luo, Yu Qiao, Kaipeng Zhang

TL;DR

This work introduces the Open-Vocabulary Keypoint Detection (OVKD) task, which is innovatively designed to use text prompts for identifying arbitrary keypoints across any species, and develops a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM), which synergistically combines vision and language models.

Abstract

Current image-based keypoint detection methods for animal (including human) bodies and faces are generally divided into full-supervised and few-shot class-agnostic approaches. The former typically relies on laborious and time-consuming manual annotations, posing considerable challenges in expanding keypoint detection to a broader range of keypoint categories and animal species. The latter, though less dependent on extensive manual input, still requires necessary support images with annotation for reference during testing. To realize zero-shot keypoint detection without any prior annotation, we introduce the Open-Vocabulary Keypoint Detection (OVKD) task, which is innovatively designed to use text prompts for identifying arbitrary keypoints across any species. In pursuit of this goal, we have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM). This framework synergistically combines vision and language models, creating an interplay between language features and local keypoint visual features. KDSM enhances its capabilities by integrating Domain Distribution Matrix Matching (DDMM) and other special modules, such as the Vision-Keypoint Relational Awareness (VKRA) module, improving the framework's generalizability and overall performance.Our comprehensive experiments demonstrate that KDSM significantly outperforms the baseline in terms of performance and achieves remarkable success in the OVKD task.Impressively, our method, operating in a zero-shot fashion, still yields results comparable to state-of-the-art few-shot species class-agnostic keypoint detection methods.We will make the source code publicly accessible.

Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching

TL;DR

This work introduces the Open-Vocabulary Keypoint Detection (OVKD) task, which is innovatively designed to use text prompts for identifying arbitrary keypoints across any species, and develops a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM), which synergistically combines vision and language models.

Abstract

Current image-based keypoint detection methods for animal (including human) bodies and faces are generally divided into full-supervised and few-shot class-agnostic approaches. The former typically relies on laborious and time-consuming manual annotations, posing considerable challenges in expanding keypoint detection to a broader range of keypoint categories and animal species. The latter, though less dependent on extensive manual input, still requires necessary support images with annotation for reference during testing. To realize zero-shot keypoint detection without any prior annotation, we introduce the Open-Vocabulary Keypoint Detection (OVKD) task, which is innovatively designed to use text prompts for identifying arbitrary keypoints across any species. In pursuit of this goal, we have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM). This framework synergistically combines vision and language models, creating an interplay between language features and local keypoint visual features. KDSM enhances its capabilities by integrating Domain Distribution Matrix Matching (DDMM) and other special modules, such as the Vision-Keypoint Relational Awareness (VKRA) module, improving the framework's generalizability and overall performance.Our comprehensive experiments demonstrate that KDSM significantly outperforms the baseline in terms of performance and achieves remarkable success in the OVKD task.Impressively, our method, operating in a zero-shot fashion, still yields results comparable to state-of-the-art few-shot species class-agnostic keypoint detection methods.We will make the source code publicly accessible.
Paper Structure (17 sections, 8 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 17 sections, 8 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Few-shot Species Class-Agnostic Keypoint Detection vs. Language-driven Open-Vocabulary Keypoint Detection. (a) Current few-shot species class-agnostic keypoint detection needs support images for guidance during training and testing to detect keypoints in new species. (b) Language-driven OVKD aims to use text prompts that embed both $\{\textit{animal species}\}$ and $\{\textit{keypoint category}\}$ as semantic guidance to localize arbitrary keypoints of any species.
  • Figure 2: An overview of the baseline method for OVKD. The baseline comprises a $\mathrm{Vision\_Encoder}$, a $\mathrm{Text\_Encoder}$, a $\mathrm{Vision\_Head}$ and a $\mathrm{Keypoint\_Adapter}$. The $\mathrm{Keypoint\_Adapter}$ is applied to optimize the relevance of text features with the image features and produce the text feature with the shape of C$\times$K, where C and K represent the number of channel and text prompts, respectively. The $\mathrm{Vision\_Head}$ produces the visual feature with the shape of C$\times$hei.$\times$wid., where hei. and wid. represent the height and width, respectively.
  • Figure 3: An overview of KDSM. KDSM comprises a $\mathrm{Vision\_Encoder}$, a $\mathrm{Text\_Encoder}$, a $\mathrm{Keypoint\_Adapter}$, a $\mathrm{Vision\_Adapter}$ and a $\mathrm{Vision\_Head}$ similar to the baseline. The vision-keypoint relational awareness module adjusts visual features according to their associations with keypoints. The $\mathrm{Vision\_Adapter}$ is employed to modify the feature shape so that it matches the text features' shape. Similarity is calculated between the adjusted features and text semantic features, resulting in a predicted distribution matrix. The predicted distribution matrix and the text domain distribution matrix are then utilized to compute matching loss.
  • Figure 4: Comparison of the trade-off between PCK@0.2 and Speed for Setting B. The speed is measured using Frames Per Second (FPS) on a single NVIDIA A100-SXM-80GB card. The test is conducted using an average of 1000 images for one species.
  • Figure 5: Comparisons of the performance (PCK@0.2) between the baseline and KSDM on long-tail species for the AnimalWeb dataset.
  • ...and 2 more figures