Query3D: LLM-Powered Open-Vocabulary Scene Segmentation with Language Embedded 3D Gaussian
Amirhosein Chahe, Lifeng Zhou
TL;DR
This work tackles open vocabulary 3D scene querying for autonomous driving by integrating Language Embedded 3D Gaussians with large language models. The authors design a pipeline where LLMs generate query related components including canonical phrases and helping positives, which are fused with language features embedded in 3D Gaussians to produce precise relevancy scores for segmentation. They validate the approach on WayveScene101, showing significant gains over fixed canonical phrase baselines, and demonstrate that smaller fine-tuned LLMs can match larger models in performance while enabling on-device inference. The findings highlight the practical potential of on-device LLM guided semantic reasoning to enhance context aware perception in autonomous systems, with clear evidence that model scale improves the utility of helping positives. This work bridges 3D scene representation with high level semantic querying to support efficient and adaptable autonomous navigation and planning.
Abstract
This paper introduces a novel method for open-vocabulary 3D scene querying in autonomous driving by combining Language Embedded 3D Gaussians with Large Language Models (LLMs). We propose utilizing LLMs to generate both contextually canonical phrases and helping positive words for enhanced segmentation and scene interpretation. Our method leverages GPT-3.5 Turbo as an expert model to create a high-quality text dataset, which we then use to fine-tune smaller, more efficient LLMs for on-device deployment. Our comprehensive evaluation on the WayveScenes101 dataset demonstrates that LLM-guided segmentation significantly outperforms traditional approaches based on predefined canonical phrases. Notably, our fine-tuned smaller models achieve performance comparable to larger expert models while maintaining faster inference times. Through ablation studies, we discover that the effectiveness of helping positive words correlates with model scale, with larger models better equipped to leverage additional semantic information. This work represents a significant advancement towards more efficient, context-aware autonomous driving systems, effectively bridging 3D scene representation with high-level semantic querying while maintaining practical deployment considerations.
