Table of Contents
Fetching ...

Query3D: LLM-Powered Open-Vocabulary Scene Segmentation with Language Embedded 3D Gaussian

Amirhosein Chahe, Lifeng Zhou

TL;DR

This work tackles open vocabulary 3D scene querying for autonomous driving by integrating Language Embedded 3D Gaussians with large language models. The authors design a pipeline where LLMs generate query related components including canonical phrases and helping positives, which are fused with language features embedded in 3D Gaussians to produce precise relevancy scores for segmentation. They validate the approach on WayveScene101, showing significant gains over fixed canonical phrase baselines, and demonstrate that smaller fine-tuned LLMs can match larger models in performance while enabling on-device inference. The findings highlight the practical potential of on-device LLM guided semantic reasoning to enhance context aware perception in autonomous systems, with clear evidence that model scale improves the utility of helping positives. This work bridges 3D scene representation with high level semantic querying to support efficient and adaptable autonomous navigation and planning.

Abstract

This paper introduces a novel method for open-vocabulary 3D scene querying in autonomous driving by combining Language Embedded 3D Gaussians with Large Language Models (LLMs). We propose utilizing LLMs to generate both contextually canonical phrases and helping positive words for enhanced segmentation and scene interpretation. Our method leverages GPT-3.5 Turbo as an expert model to create a high-quality text dataset, which we then use to fine-tune smaller, more efficient LLMs for on-device deployment. Our comprehensive evaluation on the WayveScenes101 dataset demonstrates that LLM-guided segmentation significantly outperforms traditional approaches based on predefined canonical phrases. Notably, our fine-tuned smaller models achieve performance comparable to larger expert models while maintaining faster inference times. Through ablation studies, we discover that the effectiveness of helping positive words correlates with model scale, with larger models better equipped to leverage additional semantic information. This work represents a significant advancement towards more efficient, context-aware autonomous driving systems, effectively bridging 3D scene representation with high-level semantic querying while maintaining practical deployment considerations.

Query3D: LLM-Powered Open-Vocabulary Scene Segmentation with Language Embedded 3D Gaussian

TL;DR

This work tackles open vocabulary 3D scene querying for autonomous driving by integrating Language Embedded 3D Gaussians with large language models. The authors design a pipeline where LLMs generate query related components including canonical phrases and helping positives, which are fused with language features embedded in 3D Gaussians to produce precise relevancy scores for segmentation. They validate the approach on WayveScene101, showing significant gains over fixed canonical phrase baselines, and demonstrate that smaller fine-tuned LLMs can match larger models in performance while enabling on-device inference. The findings highlight the practical potential of on-device LLM guided semantic reasoning to enhance context aware perception in autonomous systems, with clear evidence that model scale improves the utility of helping positives. This work bridges 3D scene representation with high level semantic querying to support efficient and adaptable autonomous navigation and planning.

Abstract

This paper introduces a novel method for open-vocabulary 3D scene querying in autonomous driving by combining Language Embedded 3D Gaussians with Large Language Models (LLMs). We propose utilizing LLMs to generate both contextually canonical phrases and helping positive words for enhanced segmentation and scene interpretation. Our method leverages GPT-3.5 Turbo as an expert model to create a high-quality text dataset, which we then use to fine-tune smaller, more efficient LLMs for on-device deployment. Our comprehensive evaluation on the WayveScenes101 dataset demonstrates that LLM-guided segmentation significantly outperforms traditional approaches based on predefined canonical phrases. Notably, our fine-tuned smaller models achieve performance comparable to larger expert models while maintaining faster inference times. Through ablation studies, we discover that the effectiveness of helping positive words correlates with model scale, with larger models better equipped to leverage additional semantic information. This work represents a significant advancement towards more efficient, context-aware autonomous driving systems, effectively bridging 3D scene representation with high-level semantic querying while maintaining practical deployment considerations.
Paper Structure (24 sections, 6 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 24 sections, 6 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Overview of our LLM-enhanced Language Embedded 3D Gaussian Splatting pipeline. The LE3DGS model generates language feature maps, which are then processed using Algorithm \ref{['alg:relevancy']} to compute relevancy scores with LLM-generated queries ($p_{\texttt{quer}}$), helping positives ($p_{\texttt{help}}$), and canonical phrases ($p_{\texttt{canon}}$). The system responds to the high-level queries of a driving scenario by highlighting relevant objects (cars, traffic lights, pedestrians) across multiple views.
  • Figure 2: Visual comparison of segmentation results using different Qwen models responding to a contextual query "Driving through an intersection in an urban area on a sunny day, what objects should the driver pay attention to?" From top to bottom: Results from Qwen-0.5B, 1.5B, 3B, and 7B models. Each row shows three camera views (left, front, right) of the same scene. The results demonstrate how larger models produce more precise and contextually relevant segmentations for autonomous driving scenarios.
  • Figure 3: Example conversation with GPT-3.5 Turbo to query "Pedestrian" from scene 12 in WayveScene101.
  • Figure 4: Qualitative comparison of semantic segmentation results before and after fine-tuning for different model variants. Each subfigure shows three views of the same scene, with the top row representing results from the instruction-tuned model (before fine-tuning) and the bottom row showing results from the fine-tuned model.
  • Figure 5: Visual comparison of query relevance across different scenes with different approaches. From top to bottom query words are: “trees”, “traffic signs”, “pedestrian”, “sidewalk”, and “cars”.