Table of Contents
Fetching ...

HyperFields: Towards Zero-Shot Generation of NeRFs from Text

Sudarshan Babu, Richard Liu, Avery Zhou, Michael Maire, Greg Shakhnarovich, Rana Hanocka

TL;DR

HyperFields tackles the bottleneck of text-to-3D NeRF generation by learning a shared, dynamic hypernetwork that maps text embeddings to NeRF weights, enabling zero-shot synthesis of unseen scenes. It introduces NeRF distillation to supervise training across many scenes, combining a progressive, activation-conditioned weight generation with a distillation loss to avoid SDS pitfalls. The approach achieves in-distribution zero-shot generalization to novel prompt combinations and accelerated convergence for out-of-distribution prompts, with substantial speedups over per-scene optimization and compatibility with high-detail teacher models like Prolific Dreamer. The work demonstrates a scalable path toward open-vocabulary, fast text-to-NeRF generation, with potential to generalize to other implicit 3D representations. Limitations include dependence on teacher model quality and inherited biases from diffusion models.

Abstract

We introduce HyperFields, a method for generating text-conditioned Neural Radiance Fields (NeRFs) with a single forward pass and (optionally) some fine-tuning. Key to our approach are: (i) a dynamic hypernetwork, which learns a smooth mapping from text token embeddings to the space of NeRFs; (ii) NeRF distillation training, which distills scenes encoded in individual NeRFs into one dynamic hypernetwork. These techniques enable a single network to fit over a hundred unique scenes. We further demonstrate that HyperFields learns a more general map between text and NeRFs, and consequently is capable of predicting novel in-distribution and out-of-distribution scenes -- either zero-shot or with a few finetuning steps. Finetuning HyperFields benefits from accelerated convergence thanks to the learned general map, and is capable of synthesizing novel scenes 5 to 10 times faster than existing neural optimization-based methods. Our ablation experiments show that both the dynamic architecture and NeRF distillation are critical to the expressivity of HyperFields.

HyperFields: Towards Zero-Shot Generation of NeRFs from Text

TL;DR

HyperFields tackles the bottleneck of text-to-3D NeRF generation by learning a shared, dynamic hypernetwork that maps text embeddings to NeRF weights, enabling zero-shot synthesis of unseen scenes. It introduces NeRF distillation to supervise training across many scenes, combining a progressive, activation-conditioned weight generation with a distillation loss to avoid SDS pitfalls. The approach achieves in-distribution zero-shot generalization to novel prompt combinations and accelerated convergence for out-of-distribution prompts, with substantial speedups over per-scene optimization and compatibility with high-detail teacher models like Prolific Dreamer. The work demonstrates a scalable path toward open-vocabulary, fast text-to-NeRF generation, with potential to generalize to other implicit 3D representations. Limitations include dependence on teacher model quality and inherited biases from diffusion models.

Abstract

We introduce HyperFields, a method for generating text-conditioned Neural Radiance Fields (NeRFs) with a single forward pass and (optionally) some fine-tuning. Key to our approach are: (i) a dynamic hypernetwork, which learns a smooth mapping from text token embeddings to the space of NeRFs; (ii) NeRF distillation training, which distills scenes encoded in individual NeRFs into one dynamic hypernetwork. These techniques enable a single network to fit over a hundred unique scenes. We further demonstrate that HyperFields learns a more general map between text and NeRFs, and consequently is capable of predicting novel in-distribution and out-of-distribution scenes -- either zero-shot or with a few finetuning steps. Finetuning HyperFields benefits from accelerated convergence thanks to the learned general map, and is capable of synthesizing novel scenes 5 to 10 times faster than existing neural optimization-based methods. Our ablation experiments show that both the dynamic architecture and NeRF distillation are critical to the expressivity of HyperFields.
Paper Structure (27 sections, 5 equations, 16 figures, 3 tables)

This paper contains 27 sections, 5 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: HyperFields is a hypernetwork that learns to map text to the space of weights of Neural Radiance Fields (first column). After training, HyperFields is capable of generating in-distribution scenes - unseen during training - in a feed forward manner (second column), and for out-of-distribution prompts HyperFields can be fine-tuned to yield scenes respecting prompt semantics with just a few gradient steps (third column).
  • Figure 2: Overview. Our training pipeline proceeds in two stages. Stage 1: We train a set of single prompt text-conditioned teacher NeRFs using Score Distillation Sampling. Stage 2: We distill these single scene teacher NeRFs into the hypernetwork, through a photometric loss between the renders of the hypernetwork with the teacher network, which we dub our distillation loss.
  • Figure 3: The input to the HyperFields system is a text prompt, which is encoded by a pre-trained text encoder (frozen BERT model). The text latents are passed to a Transformer module, which outputs a conditioning token (CT). This conditioning token (which supplies scene information) is used to condition each of the MLP modules in the hypernetwork. The first hypernetwork MLP (on the left) predicts the weights $W_1$ of the first layer of the NeRF MLP. The second hypernetwork MLP then takes as input both the CT and $a_1$, which are the activations from the first predicted NeRF MLP layer, and predicts the weights $W_2$ of the second layer of the NeRF MLP. The subsequent scene-conditioned hypernetwork MLPs follow the same pattern, taking the activations $a_{i-1}$ from the previous predicted NeRF MLP layer as input to generate weights $W_{i}$ for the $i^{th}$ layer of the NeRF MLP. We include stop gradients (SG) to stabilize training.
  • Figure 4: Zero-Shot In-Distribution Generalization. We train HyperFields on a 9x8 grid of object/color combination scenes, and hold out a subset of combinations. The faded scenes are in the training set and the bright scenes are the trained model's zero-shot predictions of the holdout set.
  • Figure 5: Finetuning to out-of-distribution prompts: unseen shape and or unseen attribute. Our method generates out-of-distribution scenes in at most 2k finetuning steps (row 1), whereas the baseline models are far from the desired scene at the same number of iterations (rows 2 and 3). When allowed to fine-tune for significantly longer (rows 4 and 5) the baseline generations are at best comparable to our model's generation quality, demonstrating that our model is able to adapt better to out-of-distribution scenes.
  • ...and 11 more figures