Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description

Mahmoud Ahmed; Junjie Fei; Jian Ding; Eslam Mohamed Bakr; Mohamed Elhoseiny

Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description

Mahmoud Ahmed, Junjie Fei, Jian Ding, Eslam Mohamed Bakr, Mohamed Elhoseiny

TL;DR

The paper defines PaPGD, a fine-grained 3D vision-language grounding task, and introduces 3DCoMPaT-GrIn, a large dataset with part and material annotations to support part-level grounding and description. It proposes Kestrel, a four-component 3D multimodal LLM with a query refinement mechanism that jointly learns language generation and precise point-wise segmentation masks. Through extensive experiments on Part-Aware Grounded Description, direct segmentation, and reasoning segmentation, Kestrel achieves state-of-the-art part grounding and high GPT-based 3D composition-aware language comprehension (3D-CALC). The work also demonstrates strong generalization to out-of-domain data and real-world noisy inputs, establishing a robust benchmark for part-aware 3D vision-language understanding with robotics relevance.

Abstract

In this paper, we introduce Part-Aware Point Grounded Description (PaPGD), a challenging task aimed at advancing 3D multimodal learning for fine-grained, part-aware segmentation grounding and detailed explanation of 3D objects. Existing 3D datasets largely focus on either vision-only part segmentation or vision-language scene segmentation, lacking the fine-grained multimodal segmentation needed for robotic navigation and interaction in real-world environments. To address this gap, we present the 3DCoMPaT Grounded Instructions (3DCoMPaT-GrIn) Dataset, a comprehensive resource that pairs rich point cloud descriptions with corresponding part-level segmentation masks. This dataset encompasses extensive samples designed for both PaPGD and fine-grained single-part grounding tasks. To tackle the inherent challenges of grounding objects and generating grounded descriptions at the part level, we propose Kestrel, a part-aware 3D multimodal large language model that integrates an advanced language model for nuanced language comprehension with multi-level point feature propagation and query refinement mechanism to enhance spatial reasoning at the part level. The extensive experiments demonstrate that Kestrel effectively bridges the gap between part-aware language understanding and 3D segmentation grounding, paving the way for more robust and interpretable 3D object comprehension that meets the demands of real-world robotic applications. Project page at https://feielysia.github.io/Kestrel.github.io/

Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description

TL;DR

Abstract

Paper Structure (25 sections, 11 equations, 9 figures, 14 tables)

This paper contains 25 sections, 11 equations, 9 figures, 14 tables.

Introduction
Related Work
3DCoMPaT-GrIn
Method
Kestrel
Training Objective
Experiments
Part-Aware Point Grounded Description
Single-Part Segmentation Grounding
Ablation Studies
Application
Conclusion
Acknowledgements
Mask3D Baseline
Additional Ablation Studies
...and 10 more sections

Figures (9)

Figure 1: Part-Aware Point Grounded Description. Given an input point cloud, the model is tasked with predicting a grounded description - text that provides a detailed interpretation of the 3D object. Each part-level phrase in this generated text (e.g., “back-rest” and “seat support”) is linked to a point-wise segmentation mask, challenging the model’s capability for part-aware language understanding and segmentation grounding (it is worth noting that the colors shown in this figure are not the actual colors of the point cloud but are used to represent the different segmentation masks).
Figure 2: Kestrel: A Part-Aware Point Grounding MLLM. The Kestrel model incorporates a point encoder and an LLM to construct a 3D MLLM, designed to generate detailed descriptions based on the input point cloud and text. The 3D Segmentation Decoder extracts the output embedding of the [SEG] token from the output hidden states of the 3D MLLM. After projecting these [SEG] embeddings, the 3D SGM uses them as initial queries $\mathbf{q}_0$. The point feature propagation module (PFPM) encodes multi-level point features $\mathbf{f}_{p}$. Then, the segmentation decoder takes $\mathbf{q}_0$ and $\mathbf{f}_{p}$ as input to generate the point-wise segmentation masks using a query refinement mechanism.
Figure 3: Qualitative results of Kestrel on Part-Aware Point Grounded Description, Reasoning and Direct Segmentation. The results show that Kestrel is capable of detailed 3D object understanding, providing comprehensive description and accurate part-level grounding.
Figure 4: Out-of-Domain Generalization. Kestrel demonstrates robustness when there is a domain shift from 3DCoMPaT-GrIn to Objaverse, as well as the input distribution offsets from 3D single-object training to 3D multi-object testing.
Figure 5: Real-Word Demos. Kestrel shows a certain degree of robustness to noisy and incomplete real-world inputs.
...and 4 more figures

Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description

TL;DR

Abstract

Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description

Authors

TL;DR

Abstract

Table of Contents

Figures (9)