Table of Contents
Fetching ...

OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection

Changsheng Lu, Zheyuan Liu, Piotr Koniusz

TL;DR

This work tackles the challenge of robust zero- and few-shot keypoint detection under diverse prompts by opening prompt diversity across modality, semantics, and language. It introduces OpenKD, a multimodal prototype-based framework that fuses visual and textual prompts through shared keypoint prototypes and uses LLM-driven text interpolation and parsing to generalize to unseen keypoints. The training regime leverages auxiliary keypoints/texts, intra- and inter-modality contrastive learning, and a heatmap-based decoder to achieve strong performance, with LLMs enabling parsing of diverse prompts and text interpolation via chain-of-thought prompting and false-text control. Empirically, OpenKD delivers state-of-the-art results on Z-FSKD across multiple datasets, with pronounced gains for novel keypoints, and demonstrates robust handling of diverse language prompts, aided by 96% parsing accuracy from GPT-3.5, making it practical for real-world, language-diverse keypoint detection tasks.

Abstract

Exploiting the foundation models (e.g., CLIP) to build a versatile keypoint detector has gained increasing attention. Most existing models accept either the text prompt (e.g., ``the nose of a cat''), or the visual prompt (e.g., support image with keypoint annotations), to detect the corresponding keypoints in query image, thereby, exhibiting either zero-shot or few-shot detection ability. However, the research on taking multimodal prompt is still underexplored, and the prompt diversity in semantics and language is far from opened. For example, how to handle unseen text prompts for novel keypoint detection and the diverse text prompts like ``Can you detect the nose and ears of a cat?'' In this work, we open the prompt diversity from three aspects: modality, semantics (seen v.s. unseen), and language, to enable a more generalized zero- and few-shot keypoint detection (Z-FSKD). We propose a novel OpenKD model which leverages multimodal prototype set to support both visual and textual prompting. Further, to infer the keypoint location of unseen texts, we add the auxiliary keypoints and texts interpolated from visual and textual domains into training, which improves the spatial reasoning of our model and significantly enhances zero-shot novel keypoint detection. We also found large language model (LLM) is a good parser, which achieves over 96% accuracy to parse keypoints from texts. With LLM, OpenKD can handle diverse text prompts. Experimental results show that our method achieves state-of-the-art performance on Z-FSKD and initiates new ways to deal with unseen text and diverse texts. The source code and data are available at https://github.com/AlanLuSun/OpenKD.

OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection

TL;DR

This work tackles the challenge of robust zero- and few-shot keypoint detection under diverse prompts by opening prompt diversity across modality, semantics, and language. It introduces OpenKD, a multimodal prototype-based framework that fuses visual and textual prompts through shared keypoint prototypes and uses LLM-driven text interpolation and parsing to generalize to unseen keypoints. The training regime leverages auxiliary keypoints/texts, intra- and inter-modality contrastive learning, and a heatmap-based decoder to achieve strong performance, with LLMs enabling parsing of diverse prompts and text interpolation via chain-of-thought prompting and false-text control. Empirically, OpenKD delivers state-of-the-art results on Z-FSKD across multiple datasets, with pronounced gains for novel keypoints, and demonstrates robust handling of diverse language prompts, aided by 96% parsing accuracy from GPT-3.5, making it practical for real-world, language-diverse keypoint detection tasks.

Abstract

Exploiting the foundation models (e.g., CLIP) to build a versatile keypoint detector has gained increasing attention. Most existing models accept either the text prompt (e.g., ``the nose of a cat''), or the visual prompt (e.g., support image with keypoint annotations), to detect the corresponding keypoints in query image, thereby, exhibiting either zero-shot or few-shot detection ability. However, the research on taking multimodal prompt is still underexplored, and the prompt diversity in semantics and language is far from opened. For example, how to handle unseen text prompts for novel keypoint detection and the diverse text prompts like ``Can you detect the nose and ears of a cat?'' In this work, we open the prompt diversity from three aspects: modality, semantics (seen v.s. unseen), and language, to enable a more generalized zero- and few-shot keypoint detection (Z-FSKD). We propose a novel OpenKD model which leverages multimodal prototype set to support both visual and textual prompting. Further, to infer the keypoint location of unseen texts, we add the auxiliary keypoints and texts interpolated from visual and textual domains into training, which improves the spatial reasoning of our model and significantly enhances zero-shot novel keypoint detection. We also found large language model (LLM) is a good parser, which achieves over 96% accuracy to parse keypoints from texts. With LLM, OpenKD can handle diverse text prompts. Experimental results show that our method achieves state-of-the-art performance on Z-FSKD and initiates new ways to deal with unseen text and diverse texts. The source code and data are available at https://github.com/AlanLuSun/OpenKD.
Paper Structure (14 sections, 9 equations, 5 figures, 8 tables)

This paper contains 14 sections, 9 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Illustration of multimodal prompting for keypoint detection. Our model can successfully detect keypoints given visual prompts formed by support images and keypoints (a), text prompts (b), or both (c). Graph (d) shows our model well combines the advantages of different modalities, mitigating the weakness induced by either modality.
  • Figure 2: Examples of keypoint detection under diverse text prompting. With LLM, our method can deal with diverse texts, showing potential for real-world applications. The circles and crosses refer to predictions and GT, respectively. A keypoint is regarded as a correct detection if falling in the white shadow area that signifies PCK@0.1.
  • Figure 3: The sketch of model inference. Our OpenKD allows testing under visual prompt, text prompt, or both. For clarification, we show the "both" case (i.e., 1-shot with text testing). We firstly extract the deep features of texts, support and query images via CLIP, and then adapt both modalities of features via residual refinement. After extracting the visual keypoint prototype (VKP) and textual counterpart, we build the prototype set to perform class-agnostic correlation and heatmap decoding. Finally, we fuse the heatmaps induced by two modalities (i.e., M1 & M2) to obtain predictions.
  • Figure 4: Model training and text interpolation. (a) In addition to multi-group heatmap regression, we improve model performance by introducing intra- and inter-modality contrastive learning and the novel auxiliary keypoint and text learning. (b) We exploit LLM for auxiliary texts interpolation and explore incorporating visual keypoint features for text selection in order to mitigate the noisy texts or false texts.
  • Figure 5: Different clustering effects of two modalities on base keypoints in unseen species (a) and statistical feature variance per keypoint (b).