Table of Contents
Fetching ...

XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

Ziyi Wang, Yanbo Wang, Xumin Yu, Jie Zhou, Jiwen Lu

TL;DR

A mask generator based on the denoising UNet from a pre-trained diffusion model is developed, leveraging its capability for precise textual control over dense pixel representations and enhancing the open-world adaptability of the generated masks.

Abstract

Existing methodologies in open vocabulary 3D semantic segmentation primarily concentrate on establishing a unified feature space encompassing 3D, 2D, and textual modalities. Nevertheless, traditional techniques such as global feature alignment or vision-language model distillation tend to impose only approximate correspondence, struggling notably with delineating fine-grained segmentation boundaries. To address this gap, we propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D. In our approach, we developed a mask generator based on the denoising UNet from a pre-trained diffusion model, leveraging its capability for precise textual control over dense pixel representations and enhancing the open-world adaptability of the generated masks. We further integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks with additional 3D geometry awareness. Subsequently, the generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings. Finally, we fuse complementary 2D and 3D mask features, resulting in competitive performance across multiple benchmarks for 3D open vocabulary semantic segmentation. Code is available at https://github.com/wangzy22/XMask3D.

XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

TL;DR

A mask generator based on the denoising UNet from a pre-trained diffusion model is developed, leveraging its capability for precise textual control over dense pixel representations and enhancing the open-world adaptability of the generated masks.

Abstract

Existing methodologies in open vocabulary 3D semantic segmentation primarily concentrate on establishing a unified feature space encompassing 3D, 2D, and textual modalities. Nevertheless, traditional techniques such as global feature alignment or vision-language model distillation tend to impose only approximate correspondence, struggling notably with delineating fine-grained segmentation boundaries. To address this gap, we propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D. In our approach, we developed a mask generator based on the denoising UNet from a pre-trained diffusion model, leveraging its capability for precise textual control over dense pixel representations and enhancing the open-world adaptability of the generated masks. We further integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks with additional 3D geometry awareness. Subsequently, the generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings. Finally, we fuse complementary 2D and 3D mask features, resulting in competitive performance across multiple benchmarks for 3D open vocabulary semantic segmentation. Code is available at https://github.com/wangzy22/XMask3D.

Paper Structure

This paper contains 26 sections, 15 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The overall framework of XMask3D. The 3D model with only coarse 3D-2D-text alignment struggles to segment novel categories with accurate boundaries. We propose to incorporate a 2D open mask generator conditioned on global 3D geometry features to create geometry-aware segmentation masks of novel categories. Then we apply fine-grained mask-level regularization on 3D features, thereby enhancing the dense open vocabulary capability of the 3D model. The cross-modal fusion block leverages the strengths of both branches to achieve optimal results.
  • Figure 2: The detailed architecture of XMask3D. We introduce an auxiliary 2D branch, which utilizes global point cloud features as conditional input to generate open vocabulary masks. The contour of the mask is utilized for regularization at the mask level on 3D features, and the embeddings of the mask are fused with the 3D features to enhance cross-modal complementarity.
  • Figure 2: Open-vocabulary 3D semantic segmentation results on S3DIS dataset. We report hIoU, base mIoU and novel mIoU metrics. Best open-vocabulary results are highlighted in bold.
  • Figure 3: Visualization Comparisons between XMask3D and Previous Methods. We compare XMask3D with PLA ding2023pla and OpenScene peng2023openscene on the novel categories table, bookshelf, chair and bed. The regions corresponding to the novel categories are highlighted in red boxes.
  • Figure 4: Visualization Results of Ablations. The first and second groups show results from ScanNet B15/N4 and B12/N7 benchmark, respectively. In each group, the first and second rows display segmentation results without and with the proposed mask regularization. The last three columns compare the outputs from the intermediate 2D and 3D branches with the final fusion block.
  • ...and 3 more figures