Table of Contents
Fetching ...

MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation

Jiaxin Huang, Runnan Chen, Ziwen Li, Zhengqing Gao, Xiao He, Yandong Guo, Mingming Gong, Tongliang Liu

TL;DR

This work addresses 3D reasoning segmentation without 3D annotations by transferring reasoning capabilities from 2D multimodal LLMs to 3D via multi-view pseudo-labels. A frozen 2D MLLM and SAM generate per-view masks and text embeddings, which are unprojected to 3D and fused with an attention-based, cross-view alignment scheme. A Token-for-Query mechanism and semantic/spatial losses bind the same object across views, enabling robust implicit-intent understanding and spatial reasoning in 3D. The approach achieves state-of-the-art performance in label-free settings on indoor benchmarks and demonstrates strong generalization, with notable improvements over specialized 3D models on several metrics, while highlighting trade-offs in computational cost due to multiple 2D inferences.

Abstract

Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning. While recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation, adapting these capabilities to 3D scenes remains underexplored. In this paper, we introduce MLLM-For3D, a simple yet effective framework that transfers knowledge from 2D MLLMs to 3D scene understanding. Specifically, we utilize MLLMs to generate multi-view pseudo segmentation masks and corresponding text embeddings, then unproject 2D masks into 3D space and align them with the text embeddings. The primary challenge lies in the absence of 3D context and spatial consistency across multiple views, causing the model to hallucinate objects that do not exist and fail to target objects consistently. Training the 3D model with such irrelevant objects leads to performance degradation. To address this, we introduce a spatial consistency strategy to enforce that segmentation masks remain coherent in the 3D space, effectively capturing the geometry of the scene. Moreover, we develop a Token-for-Query approach for multimodal semantic alignment, enabling consistent identification of the same object across different views. Extensive evaluations on various challenging indoor scene benchmarks demonstrate that, even without any labeled 3D training data, MLLM-For3D outperforms existing 3D reasoning segmentation methods, effectively interpreting user intent, understanding 3D scenes, and reasoning about spatial relationships.

MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation

TL;DR

This work addresses 3D reasoning segmentation without 3D annotations by transferring reasoning capabilities from 2D multimodal LLMs to 3D via multi-view pseudo-labels. A frozen 2D MLLM and SAM generate per-view masks and text embeddings, which are unprojected to 3D and fused with an attention-based, cross-view alignment scheme. A Token-for-Query mechanism and semantic/spatial losses bind the same object across views, enabling robust implicit-intent understanding and spatial reasoning in 3D. The approach achieves state-of-the-art performance in label-free settings on indoor benchmarks and demonstrates strong generalization, with notable improvements over specialized 3D models on several metrics, while highlighting trade-offs in computational cost due to multiple 2D inferences.

Abstract

Reasoning segmentation aims to segment target objects in complex scenes based on human intent and spatial reasoning. While recent multimodal large language models (MLLMs) have demonstrated impressive 2D image reasoning segmentation, adapting these capabilities to 3D scenes remains underexplored. In this paper, we introduce MLLM-For3D, a simple yet effective framework that transfers knowledge from 2D MLLMs to 3D scene understanding. Specifically, we utilize MLLMs to generate multi-view pseudo segmentation masks and corresponding text embeddings, then unproject 2D masks into 3D space and align them with the text embeddings. The primary challenge lies in the absence of 3D context and spatial consistency across multiple views, causing the model to hallucinate objects that do not exist and fail to target objects consistently. Training the 3D model with such irrelevant objects leads to performance degradation. To address this, we introduce a spatial consistency strategy to enforce that segmentation masks remain coherent in the 3D space, effectively capturing the geometry of the scene. Moreover, we develop a Token-for-Query approach for multimodal semantic alignment, enabling consistent identification of the same object across different views. Extensive evaluations on various challenging indoor scene benchmarks demonstrate that, even without any labeled 3D training data, MLLM-For3D outperforms existing 3D reasoning segmentation methods, effectively interpreting user intent, understanding 3D scenes, and reasoning about spatial relationships.

Paper Structure

This paper contains 21 sections, 4 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: For the same scene, we present three different queries and display the 2D reasoning segmentation results on the same set of frames, illustrating how the model responds to varying instructions.
  • Figure 2: Overview of the proposed MLLM-For3D framework. We adapt multimodal large language models (MLLMs) for 3D reasoning segmentation by generating multi-view pseudo-labels and filtering irrelevant views via token attention. During the training phase, we enforce cross-view consistency via a spatial consistency strategy and align an unified embeddings $\textbf{q}$ with 3D per-point feature $\mathbf{f}_p^{\text{3D}}$ via a multimodal semantic loss, enabling consistent object identity binding across views.
  • Figure 3: Visual comparisons of our MLLM-For3D versus a previous state-of-the-art method Reason3D on Intruct3D datasets. For each row, we show the ground-truth rendered scene (left), the baseline’s prediction, our result, and the textual query. Our method accurately interprets implicit user instructions and produces coherent 3D masks.