Table of Contents
Fetching ...

ZeroKey: Point-Level Reasoning and Zero-Shot 3D Keypoint Detection from Large Language Models

Bingchen Gong, Diego Gomez, Abdullah Hamdi, Abdelrahman Eldesokey, Ahmed Abdelreheem, Peter Wonka, Maks Ovsjanikov

TL;DR

The paper tackles zero-shot 3D keypoint detection by leveraging Molmo’s pixel-level reasoning to extract and name salient 3D keypoints without any ground-truth annotations. It introduces ZeroKey, a pipeline that renders multiple views, prompts an MLLM to identify 2D keypoints, back-projects them into 3D, and stabilizes results through patch-based refinement and HDBSCAN clustering across views. Evaluations on KeypointNet show the approach achieving competitive IoU with supervised and few-shot methods while outperforming strong vision-language baselines, demonstrating the potential of integrating language models for localized 3D understanding. The work further demonstrates the utility of the approach via Schelling-point analysis and point describability studies, highlighting how language-enabled spatial reasoning can guide robust 3D shape understanding and manipulation.

Abstract

We propose a novel zero-shot approach for keypoint detection on 3D shapes. Point-level reasoning on visual data is challenging as it requires precise localization capability, posing problems even for powerful models like DINO or CLIP. Traditional methods for 3D keypoint detection rely heavily on annotated 3D datasets and extensive supervised training, limiting their scalability and applicability to new categories or domains. In contrast, our method utilizes the rich knowledge embedded within Multi-Modal Large Language Models (MLLMs). Specifically, we demonstrate, for the first time, that pixel-level annotations used to train recent MLLMs can be exploited for both extracting and naming salient keypoints on 3D models without any ground truth labels or supervision. Experimental evaluations demonstrate that our approach achieves competitive performance on standard benchmarks compared to supervised methods, despite not requiring any 3D keypoint annotations during training. Our results highlight the potential of integrating language models for localized 3D shape understanding. This work opens new avenues for cross-modal learning and underscores the effectiveness of MLLMs in contributing to 3D computer vision challenges.

ZeroKey: Point-Level Reasoning and Zero-Shot 3D Keypoint Detection from Large Language Models

TL;DR

The paper tackles zero-shot 3D keypoint detection by leveraging Molmo’s pixel-level reasoning to extract and name salient 3D keypoints without any ground-truth annotations. It introduces ZeroKey, a pipeline that renders multiple views, prompts an MLLM to identify 2D keypoints, back-projects them into 3D, and stabilizes results through patch-based refinement and HDBSCAN clustering across views. Evaluations on KeypointNet show the approach achieving competitive IoU with supervised and few-shot methods while outperforming strong vision-language baselines, demonstrating the potential of integrating language models for localized 3D understanding. The work further demonstrates the utility of the approach via Schelling-point analysis and point describability studies, highlighting how language-enabled spatial reasoning can guide robust 3D shape understanding and manipulation.

Abstract

We propose a novel zero-shot approach for keypoint detection on 3D shapes. Point-level reasoning on visual data is challenging as it requires precise localization capability, posing problems even for powerful models like DINO or CLIP. Traditional methods for 3D keypoint detection rely heavily on annotated 3D datasets and extensive supervised training, limiting their scalability and applicability to new categories or domains. In contrast, our method utilizes the rich knowledge embedded within Multi-Modal Large Language Models (MLLMs). Specifically, we demonstrate, for the first time, that pixel-level annotations used to train recent MLLMs can be exploited for both extracting and naming salient keypoints on 3D models without any ground truth labels or supervision. Experimental evaluations demonstrate that our approach achieves competitive performance on standard benchmarks compared to supervised methods, despite not requiring any 3D keypoint annotations during training. Our results highlight the potential of integrating language models for localized 3D shape understanding. This work opens new avenues for cross-modal learning and underscores the effectiveness of MLLMs in contributing to 3D computer vision challenges.

Paper Structure

This paper contains 24 sections, 7 equations, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: Zero-shot 3D Keypoint Detection. Without any ground truth labels or supervised training, our method leverages the point-level reasoning embedded within MLLMs to extract and name salient keypoints on 3D models. The figure illustrates how our approach achieves competitive performance compared to CLIP-DINOiser wysoczanska2024clip baselines, highlighting the potential of integrating language models with vision tasks for enhanced 3D shape understanding.
  • Figure 2: ZeroKey Pipeline. Our proposed ZeroKey employs MLLM Molmo for zero-shot keypoint detection on 3D objects by 1) rendering multiple views for a given shape, 2) leveraging MLLM reasoning in each view using point-specific prompts, and 3) aggregating the results through clustering, eliminating the need for annotated training data for 3D keypoint detection.
  • Figure 3: Comparing the ground truth KeypointNet dataset annotations (in red) to our method's predictions (in blue). This figure showcases our results on the KeypointNet dataset, illustrating the effectiveness of our approach in keypoint detection. The close alignment of the red and blue dots demonstrates the effectiveness of our approach in accurately detecting keypoints and highlights its precision in point-level reasoning.
  • Figure 4: The number of rendered views versus the detected keypoints after aggregation. This figure shows how varying the number of rendered views affects the total number of keypoints detected by ZeroKey. The prompt here is "corner of the table". As the number of views increases, ZeroKey will detect more keypoints that suit the description of the prompt.
  • Figure 5: We ask Molmo to describe the green point, and using this as a prompt ZeroKey predicts the blue point. We show that salient points, given by the Schelling Points paper chen2012schelling, are more easily describable and consistently retrievable than arbitrary points. Some arbitrary points even lead to ZeroKey being unable to find any suitable points.
  • ...and 6 more figures