ConRF: Zero-shot Stylization of 3D Scenes with Conditioned Radiation Fields
Xingyu Miao, Yang Bai, Haoran Duan, Fan Wan, Yawen Huang, Yang Long, Yefeng Zheng
TL;DR
ConRF tackles zero-shot stylization of 3D NeRF scenes conditioned on text or image inputs by mapping CLIP features into a VGG-based style space through a learned mapping network and jointly training a shared decoder. It introduces a 3D selection volume for localized style transfer and leverages a weakly supervised framework that aligns CLIP-derived style statistics with VGG-style features, using losses that balance content preservation and stylization. The method renders stylized novel views by integrating global CLIP/VGG guidance with a local 3D feature volume, enabling text-to-style and image-to-style transfers without retraining for new styles. Empirical results on LLFF and Synthetic NeRF show competitive or superior visual quality and view-consistency compared to state-of-the-art baselines, with flexible support for text-text, text-image, and image-image style prompts and detailed ablations validating the contributions.
Abstract
Most of the existing works on arbitrary 3D NeRF style transfer required retraining on each single style condition. This work aims to achieve zero-shot controlled stylization in 3D scenes utilizing text or visual input as conditioning factors. We introduce ConRF, a novel method of zero-shot stylization. Specifically, due to the ambiguity of CLIP features, we employ a conversion process that maps the CLIP feature space to the style space of a pre-trained VGG network and then refine the CLIP multi-modal knowledge into a style transfer neural radiation field. Additionally, we use a 3D volumetric representation to perform local style transfer. By combining these operations, ConRF offers the capability to utilize either text or images as references, resulting in the generation of sequences with novel views enhanced by global or local stylization. Our experiment demonstrates that ConRF outperforms other existing methods for 3D scene and single-text stylization in terms of visual quality.
