Table of Contents
Fetching ...

Connecting NeRFs, Images, and Text

Francesco Ballerini, Pierluigi Zama Ramirez, Roberto Mirabella, Samuele Salti, Luigi Di Stefano

TL;DR

Framing NeRFs as a modality that can be linked to images and text, the paper proposes a lightweight framework to connect NeRF weight embeddings with multimodal CLIP embeddings. The approach learns bidirectional mappings between nf2vec NeRF embeddings and CLIP embeddings using two small MLPs, enabling zero-shot NeRF classification, retrieval from images or text, and NeRF generation without rendering views. Key contributions include the first framework for NeRF–image–text linking, an efficient two-MLP training scheme, and a practical adaptation path to real images via diffusion-based augmentation (ControlNet). The results demonstrate competitive zero-shot classification, effective image- and text-based NeRF retrieval, and the potential to generate new NeRFs from prompts, offering a scalable way to store and access 3D content with existing multimodal models.

Abstract

Neural Radiance Fields (NeRFs) have emerged as a standard framework for representing 3D scenes and objects, introducing a novel data type for information exchange and storage. Concurrently, significant progress has been made in multimodal representation learning for text and image data. This paper explores a novel research direction that aims to connect the NeRF modality with other modalities, similar to established methodologies for images and text. To this end, we propose a simple framework that exploits pre-trained models for NeRF representations alongside multimodal models for text and image processing. Our framework learns a bidirectional mapping between NeRF embeddings and those obtained from corresponding images and text. This mapping unlocks several novel and useful applications, including NeRF zero-shot classification and NeRF retrieval from images or text.

Connecting NeRFs, Images, and Text

TL;DR

Framing NeRFs as a modality that can be linked to images and text, the paper proposes a lightweight framework to connect NeRF weight embeddings with multimodal CLIP embeddings. The approach learns bidirectional mappings between nf2vec NeRF embeddings and CLIP embeddings using two small MLPs, enabling zero-shot NeRF classification, retrieval from images or text, and NeRF generation without rendering views. Key contributions include the first framework for NeRF–image–text linking, an efficient two-MLP training scheme, and a practical adaptation path to real images via diffusion-based augmentation (ControlNet). The results demonstrate competitive zero-shot classification, effective image- and text-based NeRF retrieval, and the potential to generate new NeRFs from prompts, offering a scalable way to store and access 3D content with existing multimodal models.

Abstract

Neural Radiance Fields (NeRFs) have emerged as a standard framework for representing 3D scenes and objects, introducing a novel data type for information exchange and storage. Concurrently, significant progress has been made in multimodal representation learning for text and image data. This paper explores a novel research direction that aims to connect the NeRF modality with other modalities, similar to established methodologies for images and text. To this end, we propose a simple framework that exploits pre-trained models for NeRF representations alongside multimodal models for text and image processing. Our framework learns a bidirectional mapping between NeRF embeddings and those obtained from corresponding images and text. This mapping unlocks several novel and useful applications, including NeRF zero-shot classification and NeRF retrieval from images or text.
Paper Structure (28 sections, 2 equations, 13 figures, 5 tables)

This paper contains 28 sections, 2 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Framework applications. Examples of the possible tasks we can perform thanks to our framework that connects NeRF s, images, and text.
  • Figure 1: Zero-shot NeRF classification results.
  • Figure 2: Feature mapping network training.clip2nerf is a feature mapping network trained to map image embeddings of NeRF views to NeRF embeddings. Conversely, nerf2clip computes the mapping in the opposite direction.
  • Figure 3: Zero-shot NeRF classification method overview.
  • Figure 5: NeRF retrieval method overview. NeRF retrieval from images (top) and text (bottom).
  • ...and 8 more figures