Table of Contents
Fetching ...

Point2Act: Efficient 3D Distillation of Multimodal LLMs for Zero-Shot Context-Aware Grasping

Sang Min Kim, Hyeongjun Heo, Junho Kim, Yonghyeon Lee, Young Min Kim

TL;DR

This work proposes Point2Act, which directly retrieves the 3D action point relevant to a contextually described task, leveraging Multimodal Large Language Models (MLLMs), and generates spatially grounded responses in under 20 seconds, facilitating practical manipulation tasks.

Abstract

We propose Point2Act, which directly retrieves the 3D action point relevant to a contextually described task, leveraging Multimodal Large Language Models (MLLMs). Foundation models opened the possibility for generalist robots that can perform a zero-shot task following natural language descriptions within an unseen environment. While the semantics obtained from large-scale image and language datasets provide contextual understanding in 2D images, the rich yet nuanced features deduce blurry 2D regions and struggle to find precise 3D locations for actions. Our proposed 3D relevancy fields bypass the high-dimensional features and instead efficiently imbue lightweight 2D point-level guidance tailored to the task-specific action. The multi-view aggregation effectively compensates for misalignments due to geometric ambiguities, such as occlusion, or semantic uncertainties inherent in the language descriptions. The output region is highly localized, reasoning fine-grained 3D spatial context that can directly transfer to an explicit position for physical action at the on-the-fly reconstruction of the scene. Our full-stack pipeline, which includes capturing, MLLM querying, 3D reconstruction, and grasp pose extraction, generates spatially grounded responses in under 20 seconds, facilitating practical manipulation tasks. Project page: https://sangminkim-99.github.io/point2act/

Point2Act: Efficient 3D Distillation of Multimodal LLMs for Zero-Shot Context-Aware Grasping

TL;DR

This work proposes Point2Act, which directly retrieves the 3D action point relevant to a contextually described task, leveraging Multimodal Large Language Models (MLLMs), and generates spatially grounded responses in under 20 seconds, facilitating practical manipulation tasks.

Abstract

We propose Point2Act, which directly retrieves the 3D action point relevant to a contextually described task, leveraging Multimodal Large Language Models (MLLMs). Foundation models opened the possibility for generalist robots that can perform a zero-shot task following natural language descriptions within an unseen environment. While the semantics obtained from large-scale image and language datasets provide contextual understanding in 2D images, the rich yet nuanced features deduce blurry 2D regions and struggle to find precise 3D locations for actions. Our proposed 3D relevancy fields bypass the high-dimensional features and instead efficiently imbue lightweight 2D point-level guidance tailored to the task-specific action. The multi-view aggregation effectively compensates for misalignments due to geometric ambiguities, such as occlusion, or semantic uncertainties inherent in the language descriptions. The output region is highly localized, reasoning fine-grained 3D spatial context that can directly transfer to an explicit position for physical action at the on-the-fly reconstruction of the scene. Our full-stack pipeline, which includes capturing, MLLM querying, 3D reconstruction, and grasp pose extraction, generates spatially grounded responses in under 20 seconds, facilitating practical manipulation tasks. Project page: https://sangminkim-99.github.io/point2act/

Paper Structure

This paper contains 18 sections, 3 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: We present Point2Act, which grounds natural language into localized 3D fields by distilling Multimodal LLMs, bridging semantic understanding and physical interaction in robotic tasks.
  • Figure 2: Overview of the Point2Act pipeline. We first capture posed images and query the MLLM deitke2024molmo with a prompt to predict 2D point annotations on the images. The multiview predictions are distilled into a 3D relevancy field. AnyGrasp fang2023anygrasp proposes grasp candidates, and the most relevant grasp is selected based on the field. Subsampled grasp poses are visualized.
  • Figure 3: System diagram of Point2Act. Point2Act achieves 59% speed-up compared to the sequential design.
  • Figure 4: Grasping performance overview. Hatched areas indicate different failure modes. The figure presents comparisons against RGB baselines in subplot (a), and against RGB-D baselines and our depth variant in subplot (b). We define the following terms: LERF* (LERF-TOGO rashid2023language), GG (GaussianGrasper zheng2024gaussiangrasper), GS (GraspSplats ji2024graspsplats), MLLM* (MLLM 2D points with depth unprojection), and GM (GraspMolmo deshpande2025graspmolmo).
  • Figure 5: Effectiveness of Multi-view 3D Distillation. There are several markers in the scene, but the marker in the mug is in the upper-left corner, and sometimes occluded by tissue. Left shows the MLLM's single-view 2D point prediction, while right depicts the 3D point projected by Point2Act. MLLM point predictions are often noisy and fail under occlusion. In contrast, our Point2Act method robustly localizes relevant 3D points by aggregating multi-view cues.
  • ...and 4 more figures