Table of Contents
Fetching ...

HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections

Chen Dudai, Morris Alper, Hana Bezalel, Rana Hanocka, Itai Lang, Hadar Averbuch-Elor

TL;DR

HaLo-NeRF targets semantic exploration of unconstrained photo collections by linking open vocabulary architectural concepts to 3D scene representations. It combines three innovations: (1) LLM based distillation of noisy image metadata into domain specific semantic labels, (2) semantic adaptation of vision language models through multi view supervision and fine tuning, and (3) 3D localization within a Ha-NeRF framework via a dedicated semantic head supervised by refined 2D segmentations. The HolyScenes benchmark enables rigorous evaluation of open vocabulary semantic localization across large scale landmarks in the wild. The results show substantial gains over 2D segmentation baselines and prior 3D methods, enabling intuitive text driven exploration of 3D reconstructions with controlled viewpoints and lighting.

Abstract

Internet image collections containing photos captured by crowds of photographers show promise for enabling digital exploration of large-scale tourist landmarks. However, prior works focus primarily on geometric reconstruction and visualization, neglecting the key role of language in providing a semantic interface for navigation and fine-grained understanding. In constrained 3D domains, recent methods have leveraged vision-and-language models as a strong prior of 2D visual semantics. While these models display an excellent understanding of broad visual semantics, they struggle with unconstrained photo collections depicting such tourist landmarks, as they lack expert knowledge of the architectural domain. In this work, we present a localization system that connects neural representations of scenes depicting large-scale landmarks with text describing a semantic region within the scene, by harnessing the power of SOTA vision-and-language models with adaptations for understanding landmark scene semantics. To bolster such models with fine-grained knowledge, we leverage large-scale Internet data containing images of similar landmarks along with weakly-related textual information. Our approach is built upon the premise that images physically grounded in space can provide a powerful supervision signal for localizing new concepts, whose semantics may be unlocked from Internet textual metadata with large language models. We use correspondences between views of scenes to bootstrap spatial understanding of these semantics, providing guidance for 3D-compatible segmentation that ultimately lifts to a volumetric scene representation. Our results show that HaLo-NeRF can accurately localize a variety of semantic concepts related to architectural landmarks, surpassing the results of other 3D models as well as strong 2D segmentation baselines. Our project page is at https://tau-vailab.github.io/HaLo-NeRF/.

HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections

TL;DR

HaLo-NeRF targets semantic exploration of unconstrained photo collections by linking open vocabulary architectural concepts to 3D scene representations. It combines three innovations: (1) LLM based distillation of noisy image metadata into domain specific semantic labels, (2) semantic adaptation of vision language models through multi view supervision and fine tuning, and (3) 3D localization within a Ha-NeRF framework via a dedicated semantic head supervised by refined 2D segmentations. The HolyScenes benchmark enables rigorous evaluation of open vocabulary semantic localization across large scale landmarks in the wild. The results show substantial gains over 2D segmentation baselines and prior 3D methods, enabling intuitive text driven exploration of 3D reconstructions with controlled viewpoints and lighting.

Abstract

Internet image collections containing photos captured by crowds of photographers show promise for enabling digital exploration of large-scale tourist landmarks. However, prior works focus primarily on geometric reconstruction and visualization, neglecting the key role of language in providing a semantic interface for navigation and fine-grained understanding. In constrained 3D domains, recent methods have leveraged vision-and-language models as a strong prior of 2D visual semantics. While these models display an excellent understanding of broad visual semantics, they struggle with unconstrained photo collections depicting such tourist landmarks, as they lack expert knowledge of the architectural domain. In this work, we present a localization system that connects neural representations of scenes depicting large-scale landmarks with text describing a semantic region within the scene, by harnessing the power of SOTA vision-and-language models with adaptations for understanding landmark scene semantics. To bolster such models with fine-grained knowledge, we leverage large-scale Internet data containing images of similar landmarks along with weakly-related textual information. Our approach is built upon the premise that images physically grounded in space can provide a powerful supervision signal for localizing new concepts, whose semantics may be unlocked from Internet textual metadata with large language models. We use correspondences between views of scenes to bootstrap spatial understanding of these semantics, providing guidance for 3D-compatible segmentation that ultimately lifts to a volumetric scene representation. Our results show that HaLo-NeRF can accurately localize a variety of semantic concepts related to architectural landmarks, surpassing the results of other 3D models as well as strong 2D segmentation baselines. Our project page is at https://tau-vailab.github.io/HaLo-NeRF/.
Paper Structure (31 sections, 17 figures, 7 tables)

This paper contains 31 sections, 17 figures, 7 tables.

Figures (17)

  • Figure 1: System overview of our approach. (a) We extract semantic pseudo-labels from noisy Internet image metadata using a large language model (LLM). (b) We use these pseudo-labels and correspondences between scene views to learn image-level and pixel-level semantics. In particular, we fine-tune an image segmentation model (CLIPSegFT) using multi-view supervision---where zoomed-in views and their associated pseudo-labels (such as image on the left associated with the term "tympanum") provide a supervision signal for zoomed-out views. (c) We then lift this semantic understanding to learn volumetric probabilities over new, unseen landmarks (such as the St. Paul's Cathedral depicted on the right), allowing for rendering views of the segmented scene with controlled viewpoints and illumination settings. See below for the definitions of the concepts shown.
  • Figure 2: LLM-based distillation of semantic concepts. The full image metadata (Input), including Filename, "caption" and WikiCategories (depicted similarly above) are used for extracting distilled semantic pseudo-labels (Output) with an LLM. Note that the associated images on top (depicted with corresponding colors) are not used as inputs for the computation of their pseudo-labels.
  • Figure 3: Adapting a text-based image segmentation model to architectural landmarks. We utilize image correspondences (such as the pairs depicted on the left) and pseudo-labels to fine-tune CLIPSeg. We propogate the pseudo-label and pseudo-label of the zoomed-in image to serve as the supervision target, as shown in the central column; we supervise predictions on the zoomed-out image only over the corresponding region (other regions are colored in grayed out for illustration purposes). This supervision (together with using random crops further described in the text) refines the model's ability to recognize and localize architectural concepts, as seen by the improved performance shown on the right.
  • Figure 4: Text-based segmentation before and after fine-tuning. Above we show 2D segmentation results over images belonging to landmarks from HolyScenes (unseen during training). As illustrated above, our weakly-supervised fine-tuning scheme improves the segmentation of domain-specific semantic concepts.
  • Figure 5: Neural 3D Localization Results. We show results from each landmark in our HolyScenes benchmark (clockwise from top: St. Paul's Cathedral, Hurva Synagogue, Notre-Dame Cathedral, Blue Mosque, Badshahi Mosque, Milan Cathedral), visualizing segmentation maps rendered from 3D HaLo-NeRF representations on input scene images. As seen above, HaLo-NeRF succeeds in localizing various semantic concepts across diverse landmarks.
  • ...and 12 more figures