Connecting NeRFs, Images, and Text
Francesco Ballerini, Pierluigi Zama Ramirez, Roberto Mirabella, Samuele Salti, Luigi Di Stefano
TL;DR
Framing NeRFs as a modality that can be linked to images and text, the paper proposes a lightweight framework to connect NeRF weight embeddings with multimodal CLIP embeddings. The approach learns bidirectional mappings between nf2vec NeRF embeddings and CLIP embeddings using two small MLPs, enabling zero-shot NeRF classification, retrieval from images or text, and NeRF generation without rendering views. Key contributions include the first framework for NeRF–image–text linking, an efficient two-MLP training scheme, and a practical adaptation path to real images via diffusion-based augmentation (ControlNet). The results demonstrate competitive zero-shot classification, effective image- and text-based NeRF retrieval, and the potential to generate new NeRFs from prompts, offering a scalable way to store and access 3D content with existing multimodal models.
Abstract
Neural Radiance Fields (NeRFs) have emerged as a standard framework for representing 3D scenes and objects, introducing a novel data type for information exchange and storage. Concurrently, significant progress has been made in multimodal representation learning for text and image data. This paper explores a novel research direction that aims to connect the NeRF modality with other modalities, similar to established methodologies for images and text. To this end, we propose a simple framework that exploits pre-trained models for NeRF representations alongside multimodal models for text and image processing. Our framework learns a bidirectional mapping between NeRF embeddings and those obtained from corresponding images and text. This mapping unlocks several novel and useful applications, including NeRF zero-shot classification and NeRF retrieval from images or text.
