Table of Contents
Fetching ...

ConRF: Zero-shot Stylization of 3D Scenes with Conditioned Radiation Fields

Xingyu Miao, Yang Bai, Haoran Duan, Fan Wan, Yawen Huang, Yang Long, Yefeng Zheng

TL;DR

ConRF tackles zero-shot stylization of 3D NeRF scenes conditioned on text or image inputs by mapping CLIP features into a VGG-based style space through a learned mapping network and jointly training a shared decoder. It introduces a 3D selection volume for localized style transfer and leverages a weakly supervised framework that aligns CLIP-derived style statistics with VGG-style features, using losses that balance content preservation and stylization. The method renders stylized novel views by integrating global CLIP/VGG guidance with a local 3D feature volume, enabling text-to-style and image-to-style transfers without retraining for new styles. Empirical results on LLFF and Synthetic NeRF show competitive or superior visual quality and view-consistency compared to state-of-the-art baselines, with flexible support for text-text, text-image, and image-image style prompts and detailed ablations validating the contributions.

Abstract

Most of the existing works on arbitrary 3D NeRF style transfer required retraining on each single style condition. This work aims to achieve zero-shot controlled stylization in 3D scenes utilizing text or visual input as conditioning factors. We introduce ConRF, a novel method of zero-shot stylization. Specifically, due to the ambiguity of CLIP features, we employ a conversion process that maps the CLIP feature space to the style space of a pre-trained VGG network and then refine the CLIP multi-modal knowledge into a style transfer neural radiation field. Additionally, we use a 3D volumetric representation to perform local style transfer. By combining these operations, ConRF offers the capability to utilize either text or images as references, resulting in the generation of sequences with novel views enhanced by global or local stylization. Our experiment demonstrates that ConRF outperforms other existing methods for 3D scene and single-text stylization in terms of visual quality.

ConRF: Zero-shot Stylization of 3D Scenes with Conditioned Radiation Fields

TL;DR

ConRF tackles zero-shot stylization of 3D NeRF scenes conditioned on text or image inputs by mapping CLIP features into a VGG-based style space through a learned mapping network and jointly training a shared decoder. It introduces a 3D selection volume for localized style transfer and leverages a weakly supervised framework that aligns CLIP-derived style statistics with VGG-style features, using losses that balance content preservation and stylization. The method renders stylized novel views by integrating global CLIP/VGG guidance with a local 3D feature volume, enabling text-to-style and image-to-style transfers without retraining for new styles. Empirical results on LLFF and Synthetic NeRF show competitive or superior visual quality and view-consistency compared to state-of-the-art baselines, with flexible support for text-text, text-image, and image-image style prompts and detailed ablations validating the contributions.

Abstract

Most of the existing works on arbitrary 3D NeRF style transfer required retraining on each single style condition. This work aims to achieve zero-shot controlled stylization in 3D scenes utilizing text or visual input as conditioning factors. We introduce ConRF, a novel method of zero-shot stylization. Specifically, due to the ambiguity of CLIP features, we employ a conversion process that maps the CLIP feature space to the style space of a pre-trained VGG network and then refine the CLIP multi-modal knowledge into a style transfer neural radiation field. Additionally, we use a 3D volumetric representation to perform local style transfer. By combining these operations, ConRF offers the capability to utilize either text or images as references, resulting in the generation of sequences with novel views enhanced by global or local stylization. Our experiment demonstrates that ConRF outperforms other existing methods for 3D scene and single-text stylization in terms of visual quality.
Paper Structure (30 sections, 21 equations, 16 figures, 3 tables)

This paper contains 30 sections, 21 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Zero-shot 3D style transfer of single condition. Given a set of multi-view content images of a 3D scene, ConRF can transfer an arbitrary text reference style or an arbitrary image reference style to the 3D scene in a zero-shot manner.
  • Figure 2: Mitigating the ambiguity of CLIP features via mapping module. The feature obtained from the CLIP extractor shows the clear high-level expression resulting in highly similar distributions for similar sunflowers or Shiba Inu with different styles, which leads to the lack of fine level features (eg. textures). On the contrary, the VGG features can better reveal the differences between the same sunflower or Shiba Inu images with different styles. To alleviate this problem, we use a mapping module to map CLIP's feature space into a style space. Features in the style space reduce this ambiguity and encourage differentiation among similar feature distributions.
  • Figure 3: The pipeline of ConRF. ConRF performs style transfer on pre-trained feature NeRF. It consists of two branches: VGG and CLIP, which use the same style of transmission modules and share a decoder. The VGG branch uses pre-trained VGG19 simonyan2014very to extract style features, and weakly supervises the CLIP branch using a CLIP image encoder to extract features to optimize the mapping module of the CLIP branch to separate style features. Finally, these two branches jointly optimize the decoder to obtain a stylized image. Additionally, to achieve the purpose of local transmission, we optimize an additional branch for the featured NeRF.
  • Figure 4: The inference of ConRF. After the training phase, ConRF is equipped to apply 3D stylistic transformations directly using text or images. Additionally, users can input specific content selection prompts to control the stylized region.
  • Figure 5: Comparison with four SOTA 3D style transfer methods using reference style images. For the four scenes in the example our method produces significantly better 3D style transfer.
  • ...and 11 more figures