Table of Contents
Fetching ...

CityLoc: 6DoF Pose Distributional Localization for Text Descriptions in Large-Scale Scenes with Gaussian Representation

Qi Ma, Runyi Yang, Bin Ren, Nicu Sebe, Ender Konukoglu, Luc Van Gool, Danda Pani Paudel

TL;DR

CityLoc tackles the challenge of localizing textual descriptions within expansive 3D city scenes by generating a distribution over camera poses conditioned on text. It combines a diffusion-based model to learn $p(P|\\mathcal{T})$ with a Transformer denoiser and a Gaussian splatting renderer to refine pose samples via visual reasoning, guided by CLIP-based cross-modal features. The method incorporates a Mixup-style multi-modal conditioning and a Gaussian refinement step that renders poses and optimizes their alignment with textual descriptions, achieving superior Relative Distribution Accuracy across five large-scale datasets. This enables robust, language-driven localization and multi-modal scene understanding at city scale, with practical implications for autonomous navigation and human-robot interaction; future work includes leveraging stronger visual language models for richer text prompts.

Abstract

Localizing textual descriptions within large-scale 3D scenes presents inherent ambiguities, such as identifying all traffic lights in a city. Addressing this, we introduce a method to generate distributions of camera poses conditioned on textual descriptions, facilitating robust reasoning for broadly defined concepts. Our approach employs a diffusion-based architecture to refine noisy 6DoF camera poses towards plausible locations, with conditional signals derived from pre-trained text encoders. Integration with the pretrained Vision-Language Model, CLIP, establishes a strong linkage between text descriptions and pose distributions. Enhancement of localization accuracy is achieved by rendering candidate poses using 3D Gaussian splatting, which corrects misaligned samples through visual reasoning. We validate our method's superiority by comparing it against standard distribution estimation methods across five large-scale datasets, demonstrating consistent outperformance. Code, datasets and more information will be publicly available at our project page.

CityLoc: 6DoF Pose Distributional Localization for Text Descriptions in Large-Scale Scenes with Gaussian Representation

TL;DR

CityLoc tackles the challenge of localizing textual descriptions within expansive 3D city scenes by generating a distribution over camera poses conditioned on text. It combines a diffusion-based model to learn with a Transformer denoiser and a Gaussian splatting renderer to refine pose samples via visual reasoning, guided by CLIP-based cross-modal features. The method incorporates a Mixup-style multi-modal conditioning and a Gaussian refinement step that renders poses and optimizes their alignment with textual descriptions, achieving superior Relative Distribution Accuracy across five large-scale datasets. This enables robust, language-driven localization and multi-modal scene understanding at city scale, with practical implications for autonomous navigation and human-robot interaction; future work includes leveraging stronger visual language models for richer text prompts.

Abstract

Localizing textual descriptions within large-scale 3D scenes presents inherent ambiguities, such as identifying all traffic lights in a city. Addressing this, we introduce a method to generate distributions of camera poses conditioned on textual descriptions, facilitating robust reasoning for broadly defined concepts. Our approach employs a diffusion-based architecture to refine noisy 6DoF camera poses towards plausible locations, with conditional signals derived from pre-trained text encoders. Integration with the pretrained Vision-Language Model, CLIP, establishes a strong linkage between text descriptions and pose distributions. Enhancement of localization accuracy is achieved by rendering candidate poses using 3D Gaussian splatting, which corrects misaligned samples through visual reasoning. We validate our method's superiority by comparing it against standard distribution estimation methods across five large-scale datasets, demonstrating consistent outperformance. Code, datasets and more information will be publicly available at our project page.
Paper Structure (21 sections, 11 equations, 14 figures, 3 tables, 2 algorithms)

This paper contains 21 sections, 11 equations, 14 figures, 3 tables, 2 algorithms.

Figures (14)

  • Figure 1: CityLoc: Given an ambiguous text description, our method accurately estimates the camera pose distribution across a large-scale urban environment, pinpointing probable locations like parking spots. Using Vision-Language Models (VLMs), our approach also incorporates image inputs for more precise, context-aware pose localization.
  • Figure 2: We present qualitative results of our large-scale Gaussian splats, including the number of images and the trained Gaussian memory size for each scene.
  • Figure 3: Overview of CityLoc. In the training process, where images and multi-level of granularity text input are first converted to CLIP features. A mix algorithm combines these features to train a pose diffusion model, mapping them to a 6DoF camera pose distribution. In the inference phase, where the pose diffusion model outputs camera poses for any given text input. A pretrained Gaussian representation is used to refine the poses, aligning the input text features with the rendered image features.
  • Figure 4: Qualitative results on the small town dataset: The enlarged green camera and its corresponding images represent those used to generate multiple text prompts with varying levels of granularity. We report the pose distribution conditioned on different levels of text details. The results clearly demonstrate that more informative text inputs lead to more precise location estimates. Additionally, cameras estimated in other locations provide meaningful insights. This is illustrated by selecting a pose within a high-density area for rendering as shown in red camera and orange camera, where both estimates reveal the presence of a traffic light. Zoom in for better visual results.
  • Figure 5: An example question from the user study and its corresponding qualitative results. Ground-truth (GT) images appear on the left, while the rendered images are shown on the right. Quantitative results of user study please refer to \ref{['tab:exp:user_study']}.
  • ...and 9 more figures