Table of Contents
Fetching ...

Scaling LLaNA: Advancing NeRF-Language Understanding Through Large-Scale Training

Andrea Amaduzzi, Pierluigi Zama Ramirez, Giuseppe Lisanti, Samuele Salti, Luigi Di Stefano

TL;DR

This work presents LLaNA, the first Multimodal LLM that directly ingests NeRF weights to perform NeRF-captioning, QA, and zero-shot tasks, bypassing the need to render images or extract geometry. It introduces nf2vec-based meta-encoding of NeRF MLP weights and a projection to LLaMA-2, enabling coherent NeRF-language reasoning. The authors construct ObjaNeRF-Text and ShapeNeRF-Text as large-scale NeRF-language benchmarks (>320K NeRFs) and demonstrate that direct weight-based processing yields superior performance on NeRF-language tasks compared with baselines that rely on 2D or 3D representations, with limited gains from increasing LLM size. The work provides a strong case for NeRFs as a standalone modality for language-driven understanding of 3D objects, highlights the importance of task-aligned training data, and outlines directions for extending to more complex NeRF architectures and scenes. Overall, LLaNA advances NeRF-language understanding and offers scalable benchmarks and insights for future multimodal 3D-language research.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in understanding both images and 3D data, yet these modalities face inherent limitations in comprehensively representing object geometry and appearance. Neural Radiance Fields (NeRFs) have emerged as a promising alternative, encoding both geometric and photorealistic properties within the weights of a simple Multi-Layer Perceptron (MLP). This work investigates the feasibility and effectiveness of ingesting NeRFs into an MLLM. We introduce LLaNA, the first MLLM able to perform new tasks such as NeRF captioning and Q\&A, by directly processing the weights of a NeRF's MLP. Notably, LLaNA is able to extract information about the represented objects without the need to render images or materialize 3D data structures. In addition, we build the first large-scale NeRF-language dataset, composed by more than 300K NeRFs trained on ShapeNet and Objaverse, with paired textual annotations that enable various NeRF-language tasks. Based on this dataset, we develop a benchmark to evaluate the NeRF understanding capability of our method. Results show that directly processing NeRF weights leads to better performance on NeRF-Language tasks compared to approaches that rely on either 2D or 3D representations derived from NeRFs.

Scaling LLaNA: Advancing NeRF-Language Understanding Through Large-Scale Training

TL;DR

This work presents LLaNA, the first Multimodal LLM that directly ingests NeRF weights to perform NeRF-captioning, QA, and zero-shot tasks, bypassing the need to render images or extract geometry. It introduces nf2vec-based meta-encoding of NeRF MLP weights and a projection to LLaMA-2, enabling coherent NeRF-language reasoning. The authors construct ObjaNeRF-Text and ShapeNeRF-Text as large-scale NeRF-language benchmarks (>320K NeRFs) and demonstrate that direct weight-based processing yields superior performance on NeRF-language tasks compared with baselines that rely on 2D or 3D representations, with limited gains from increasing LLM size. The work provides a strong case for NeRFs as a standalone modality for language-driven understanding of 3D objects, highlights the importance of task-aligned training data, and outlines directions for extending to more complex NeRF architectures and scenes. Overall, LLaNA advances NeRF-language understanding and offers scalable benchmarks and insights for future multimodal 3D-language research.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in understanding both images and 3D data, yet these modalities face inherent limitations in comprehensively representing object geometry and appearance. Neural Radiance Fields (NeRFs) have emerged as a promising alternative, encoding both geometric and photorealistic properties within the weights of a simple Multi-Layer Perceptron (MLP). This work investigates the feasibility and effectiveness of ingesting NeRFs into an MLLM. We introduce LLaNA, the first MLLM able to perform new tasks such as NeRF captioning and Q\&A, by directly processing the weights of a NeRF's MLP. Notably, LLaNA is able to extract information about the represented objects without the need to render images or materialize 3D data structures. In addition, we build the first large-scale NeRF-language dataset, composed by more than 300K NeRFs trained on ShapeNet and Objaverse, with paired textual annotations that enable various NeRF-language tasks. Based on this dataset, we develop a benchmark to evaluate the NeRF understanding capability of our method. Results show that directly processing NeRF weights leads to better performance on NeRF-Language tasks compared to approaches that rely on either 2D or 3D representations derived from NeRFs.

Paper Structure

This paper contains 17 sections, 10 figures, 13 tables.

Figures (10)

  • Figure 1: LLaNA. A new Multimodal Large Language Model that understands and reasons on an input NeRF. Notably, our framework processes directly the NeRF weights and performs tasks such as captioning, Q&A, and zero-shot classification of NeRFs.
  • Figure 2: Framework overview. Example of NeRF captioning.
  • Figure 3: ObjaNeRF--Text statistics of ground-truth text annotations
  • Figure 4: Automatic annotation pipeline. Given a 3D model, $N$ views are rendered and processed by a VLM (LLaVA) to generate view-specific captions. These are aggregated by an LLM (LLaMA) for final descriptions and Q&A.
  • Figure 5: Qualitative results on ShapeNeRF--Text brief descriptions.
  • ...and 5 more figures