DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding
Qinghongbing Xie, Zijian Liang, Fuhao Li, Long Zeng
TL;DR
The paper tackles the limited semantic richness of 3D grounding by introducing the Diverse Semantic Map (DSM), a persistent, multi-view 3D representation that fuses geometry with VLM-derived semantics across appearance, physical properties, and affordances. DSM-Grounding then transforms grounding from direct VLM queries into structured reasoning over this map, using candidate retrieval, relational filtering, and multi-level verification to achieve higher accuracy and interpretability. Empirical results show state-of-the-art zero-shot performance on ScanRefer (e.g., Acc@0.5 = $59.06$) and strong segmentation on Replica ($F$-IoU = $67.93$), along with successful robotic navigation and grasping in real-world settings. The work demonstrates the practical value of a dense, attribute-rich world model for robust perception, grounding, and manipulation in complex environments, while acknowledging dependencies on perception quality and latency of large language–vision models.
Abstract
Effective scene representation is critical for the visual grounding ability of representations, yet existing methods for 3D Visual Grounding are often constrained. They either only focus on geometric and visual cues, or, like traditional 3D scene graphs, lack the multi-dimensional attributes needed for complex reasoning. To bridge this gap, we introduce the Diverse Semantic Map (DSM) framework, a novel scene representation framework that enriches robust geometric models with a spectrum of VLM-derived semantics, including appearance, physical properties, and affordances. The DSM is first constructed online by fusing multi-view observations within a temporal sliding window, creating a persistent and comprehensive world model. Building on this foundation, we propose DSM-Grounding, a new paradigm that shifts grounding from free-form VLM queries to a structured reasoning process over the semantic-rich map, markedly improving accuracy and interpretability. Extensive evaluations validate our approach's superiority. On the ScanRefer benchmark, DSM-Grounding achieves a state-of-the-art 59.06% overall accuracy of IoU@0.5, surpassing others by 10%. In semantic segmentation, our DSM attains a 67.93% F-mIoU, outperforming all baselines, including privileged ones. Furthermore, successful deployment on physical robots for complex navigation and grasping tasks confirms the framework's practical utility in real-world scenarios.
