Table of Contents
Fetching ...

Weight Space Representation Learning on Diverse NeRF Architectures

Francesco Ballerini, Pierluigi Zama Ramirez, Luigi Di Stefano, Samuele Salti

TL;DR

The paper tackles the challenge of applying downstream tasks to NeRF representations that come in many architectures by using a Graph Meta-Network to embed NeRF parameter graphs into a common latent space. It couples a rendering-based objective with a SigLIP contrastive loss to produce embeddings that reflect object content rather than architectural encoding, enabling robust classification, retrieval, and language tasks across MLP, tri-plane, and hash-table NeRFs, including unseen architectures. The approach achieves competitive or superior performance compared to single-architecture baselines and demonstrates generalization to new datasets (Objaverse) and multi-modal language tasks, suggesting a path toward a foundational NeRF weight-space model. Limitations include evaluation on a single primary dataset (ShapeNetRender) and planned expansion to larger-scale NeRF collections. Overall, the work offers a scalable, architecture-agnostic paradigm for NeRF weight space processing with broad downstream applicability.

Abstract

Neural Radiance Fields (NeRFs) have emerged as a groundbreaking paradigm for representing 3D objects and scenes by encoding shape and appearance information into the weights of a neural network. Recent studies have demonstrated that these weights can be used as input for frameworks designed to address deep learning tasks; however, such frameworks require NeRFs to adhere to a specific, predefined architecture. In this paper, we introduce the first framework capable of processing NeRFs with diverse architectures and performing inference on architectures unseen at training time. We achieve this by training a Graph Meta-Network within an unsupervised representation learning framework, and show that a contrastive objective is conducive to obtaining an architecture-agnostic latent space. In experiments conducted across 13 NeRF architectures belonging to three families (MLPs, tri-planes, and, for the first time, hash tables), our approach demonstrates robust performance in classification, retrieval, and language tasks involving multiple architectures, even unseen at training time, while also matching or exceeding the results of existing frameworks limited to single architectures.

Weight Space Representation Learning on Diverse NeRF Architectures

TL;DR

The paper tackles the challenge of applying downstream tasks to NeRF representations that come in many architectures by using a Graph Meta-Network to embed NeRF parameter graphs into a common latent space. It couples a rendering-based objective with a SigLIP contrastive loss to produce embeddings that reflect object content rather than architectural encoding, enabling robust classification, retrieval, and language tasks across MLP, tri-plane, and hash-table NeRFs, including unseen architectures. The approach achieves competitive or superior performance compared to single-architecture baselines and demonstrates generalization to new datasets (Objaverse) and multi-modal language tasks, suggesting a path toward a foundational NeRF weight-space model. Limitations include evaluation on a single primary dataset (ShapeNetRender) and planned expansion to larger-scale NeRF collections. Overall, the work offers a scalable, architecture-agnostic paradigm for NeRF weight space processing with broad downstream applicability.

Abstract

Neural Radiance Fields (NeRFs) have emerged as a groundbreaking paradigm for representing 3D objects and scenes by encoding shape and appearance information into the weights of a neural network. Recent studies have demonstrated that these weights can be used as input for frameworks designed to address deep learning tasks; however, such frameworks require NeRFs to adhere to a specific, predefined architecture. In this paper, we introduce the first framework capable of processing NeRFs with diverse architectures and performing inference on architectures unseen at training time. We achieve this by training a Graph Meta-Network within an unsupervised representation learning framework, and show that a contrastive objective is conducive to obtaining an architecture-agnostic latent space. In experiments conducted across 13 NeRF architectures belonging to three families (MLPs, tri-planes, and, for the first time, hash tables), our approach demonstrates robust performance in classification, retrieval, and language tasks involving multiple architectures, even unseen at training time, while also matching or exceeding the results of existing frameworks limited to single architectures.

Paper Structure

This paper contains 16 sections, 5 equations, 9 figures, 24 tables.

Figures (9)

  • Figure 1: Framework overview. Our representation learning framework leverages a Graph Meta-Network lim2024graph encoder to map weights of NeRFs with diverse architectures to a latent space where NeRFs representing similar objects are close to each other, regardless of their architecture. The embeddings are then used as input to downstream pipelines for classification, retrieval, and language tasks.
  • Figure 2: Method overview.Left: parameter graph construction for an MLP (left), a tri-plane (middle), and a multi-resolution hash table (right). For better clarity, the graphs of a single $2\times2\times2$ plane and of two $4\times2$ hash tables are shown. Right: our framework leverages a Graph Meta-Network lim2024graph encoder alongside the nf2vec decoder ramirez2024deep and is trained end-to-end on a dataset of NeRFs with different architectures ($\mathcal{N}_j^A$, $\mathcal{N}_j^B$) with both a rendering ($\mathcal{L}_\text{R}$) and a contrastive ($\mathcal{L}_\text{C}$) loss.
  • Figure 3: t-SNE plots. 2D projections of the latent space created by our framework when trained on a dataset of NeRFs of ShapenetRender objects xu2019disn, where each object is represented by three NeRFs parameterized by different architectures: MLPs, tri-planes, and multi-resolution hash tables.
  • Figure 4: Across-datasets NeRF retrieval ($\mathcal{L}_{\text{R}+\text{C}}$). Query from the test set of $\texttt{MLP}^\texttt{OB}$amaduzzi2025scaling, gallery from the test set of MLP, TRI, or HASH.
  • Figure 5: Parameter graph conversion.Top left: parameter graph representation of an MLP, proposed by lim2024graph. Right: parameter graph representation of a tri-plane, proposed by lim2024graph. Dotted edges should be connected to the $C$ channel nodes, but are not fully drawn for better visual clarity. Bottom left: our parameter graph representation of a multi-resolution hash table.
  • ...and 4 more figures