G3DST: Generalizing 3D Style Transfer with Neural Radiance Fields across Scenes and Styles
Adil Meric, Umut Kocasari, Matthias Nießner, Barbara Roessle
TL;DR
This work addresses the bottleneck of per-scene optimization in NeRF-based 3D style transfer by introducing a generalizable NeRF Transformer augmented with a hypernetwork that conditions a style latent $z_s$ from a Style-VAE. The model renders view-consistent stylized novel views for unseen scenes and styles at inference time, using a loss that combines $L_{content}$, $L_{style}$, and a novel multi-view consistency term $L_{consistency}$ based on optical flow between views: $L_{total} = L_{content} + w_s L_{style} + w_c L_{consistency}$. Key contributions include the first generalizable 3D style transfer across scenes and styles, a flow-based consistency loss to preserve cross-view fidelity, and an efficient, on-the-fly stylization pipeline that outperforms per-scene methods in both quality and speed. The results demonstrate high-quality stylizations with strong multi-view consistency, enabling practical, scene-agnostic 3D style transfer without scene-specific retraining, with potential impact on real-time 3D content creation and editing.
Abstract
Neural Radiance Fields (NeRF) have emerged as a powerful tool for creating highly detailed and photorealistic scenes. Existing methods for NeRF-based 3D style transfer need extensive per-scene optimization for single or multiple styles, limiting the applicability and efficiency of 3D style transfer. In this work, we overcome the limitations of existing methods by rendering stylized novel views from a NeRF without the need for per-scene or per-style optimization. To this end, we take advantage of a generalizable NeRF model to facilitate style transfer in 3D, thereby enabling the use of a single learned model across various scenes. By incorporating a hypernetwork into a generalizable NeRF, our approach enables on-the-fly generation of stylized novel views. Moreover, we introduce a novel flow-based multi-view consistency loss to preserve consistency across multiple views. We evaluate our method across various scenes and artistic styles and show its performance in generating high-quality and multi-view consistent stylized images without the need for a scene-specific implicit model. Our findings demonstrate that this approach not only achieves a good visual quality comparable to that of per-scene methods but also significantly enhances efficiency and applicability, marking a notable advancement in the field of 3D style transfer.
