Table of Contents
Fetching ...

CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D

Mohamad Amin Mirzaei, Pantea Amoie, Ali Ekhterachian, Matin Mirzababaei, Babak Khalaj

TL;DR

CORE-3D tackles open-vocabulary 3D perception by correcting fragmentation and context loss in 2D-to-3D mappings. It combines SemanticSAM-based progressive mask refinement with context-aware, multi-crop CLIP embeddings and a 3D merging step to produce coherent object-level semantic maps without 3D supervision. The approach demonstrates state-of-the-art results in 3D open-vocabulary semantic segmentation on Replica and ScanNet and significantly improves natural-language object retrieval on SR3D+. By integrating LLM/VLM verification and structured language grounding, CORE-3D offers robust zero-shot 3D perception and language-grounded reasoning suitable for embodied AI and robotics.

Abstract

3D scene understanding is fundamental for embodied AI and robotics, supporting reliable perception for interaction and navigation. Recent approaches achieve zero-shot, open-vocabulary 3D semantic mapping by assigning embedding vectors to 2D class-agnostic masks generated via vision-language models (VLMs) and projecting these into 3D. However, these methods often produce fragmented masks and inaccurate semantic assignments due to the direct use of raw masks, limiting their effectiveness in complex environments. To address this, we leverage SemanticSAM with progressive granularity refinement to generate more accurate and numerous object-level masks, mitigating the over-segmentation commonly observed in mask generation models such as vanilla SAM, and improving downstream 3D semantic segmentation. To further enhance semantic context, we employ a context-aware CLIP encoding strategy that integrates multiple contextual views of each mask using empirically determined weighting, providing much richer visual context. We evaluate our approach on multiple 3D scene understanding tasks, including 3D semantic segmentation and object retrieval from language queries, across several benchmark datasets. Experimental results demonstrate significant improvements over existing methods, highlighting the effectiveness of our approach.

CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D

TL;DR

CORE-3D tackles open-vocabulary 3D perception by correcting fragmentation and context loss in 2D-to-3D mappings. It combines SemanticSAM-based progressive mask refinement with context-aware, multi-crop CLIP embeddings and a 3D merging step to produce coherent object-level semantic maps without 3D supervision. The approach demonstrates state-of-the-art results in 3D open-vocabulary semantic segmentation on Replica and ScanNet and significantly improves natural-language object retrieval on SR3D+. By integrating LLM/VLM verification and structured language grounding, CORE-3D offers robust zero-shot 3D perception and language-grounded reasoning suitable for embodied AI and robotics.

Abstract

3D scene understanding is fundamental for embodied AI and robotics, supporting reliable perception for interaction and navigation. Recent approaches achieve zero-shot, open-vocabulary 3D semantic mapping by assigning embedding vectors to 2D class-agnostic masks generated via vision-language models (VLMs) and projecting these into 3D. However, these methods often produce fragmented masks and inaccurate semantic assignments due to the direct use of raw masks, limiting their effectiveness in complex environments. To address this, we leverage SemanticSAM with progressive granularity refinement to generate more accurate and numerous object-level masks, mitigating the over-segmentation commonly observed in mask generation models such as vanilla SAM, and improving downstream 3D semantic segmentation. To further enhance semantic context, we employ a context-aware CLIP encoding strategy that integrates multiple contextual views of each mask using empirically determined weighting, providing much richer visual context. We evaluate our approach on multiple 3D scene understanding tasks, including 3D semantic segmentation and object retrieval from language queries, across several benchmark datasets. Experimental results demonstrate significant improvements over existing methods, highlighting the effectiveness of our approach.

Paper Structure

This paper contains 34 sections, 15 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of object retrieval from natural language in a 3D scene. A natural language query specifies a target and spatial relation (“Facing the cabinet, on which side of the cabinet is the door?”). Our framework retrieves object embeddings, grounds them in 3D coordinates, selects the appropriate view to face the cabinet using VLM, and reasons about spatial orientation to output the correct relation.
  • Figure 2: Overview of our training-free open-vocabulary 3D semantic segmentation and retrieval pipeline. Given RGB--D image sequences, we first generate progressive multi-granularity 2D masks ($M_t$) to mitigate fragmentation. Each mask is encoded with CLIP using multiple contextual crops (mask, bounding box, large, huge, surroundings), and their embeddings are aggregated via weighted averaging. In parallel, depth maps and poses are fused into a 3D point cloud, where embeddings are assigned per point-cloud mask. Multi-view predictions are merged and refined with DBSCAN clustering to enforce consistency, resulting in a coherent 3D semantic map with point-cloud embeddings that support both open-vocabulary segmentation and object retrieval.
  • Figure 3: Pipeline for natural-language object retrieval. A free-form query is parsed into structured form $\Pi(q) = (m, \mathcal{R}, \Omega)$. Candidate objects are mined using CLIP similarity and DBSCAN clustering, projected into frames, and verified by a VLM restricted to bounding boxes. If orientation constraints $\Omega$ are present, canonical views are rendered and resolved with a VLM. Finally, an LLM reasons over the verified candidates, referenced objects $\mathcal{R}$, and orientation cues to select the final prediction.
  • Figure 4: Qualitative comparison of 3D open-vocabulary semantic segmentation on Replica scenes. The GT, BBQ-CLIP, ConceptGraphs, OpenFusion, ConceptFusion columns are adapted from linok2025bbq, reproduced here under fair use for research comparison. Our method yields more accurate segmentation boundaries and finer recognition of challenging categories; notably, it is the only method that segments and labels the rug correctly, a class frequently missed or confused by competing approaches.
  • Figure 5: Hyperparameter sensitivity on Replica (scene room0) for the four weighting parameters $\alpha_h, \alpha_l, \alpha_o, \alpha_m$. Performance is reported in mIoU, mAcc, and fmIoU as the scaling factor is varied.
  • ...and 3 more figures