Table of Contents
Fetching ...

Style-NeRF2NeRF: 3D Style Transfer From Style-Aligned Multi-View Images

Haruo Fujiwara, Yusuke Mukuta, Tatsuya Harada

TL;DR

A simple yet effective pipeline for stylizing a 3D scene, harnessing the power of 2D image diffusion models, that can transfer diverse artistic styles to real-world 3D scenes with competitive quality is proposed.

Abstract

We propose a simple yet effective pipeline for stylizing a 3D scene, harnessing the power of 2D image diffusion models. Given a NeRF model reconstructed from a set of multi-view images, we perform 3D style transfer by refining the source NeRF model using stylized images generated by a style-aligned image-to-image diffusion model. Given a target style prompt, we first generate perceptually similar multi-view images by leveraging a depth-conditioned diffusion model with an attention-sharing mechanism. Next, based on the stylized multi-view images, we propose to guide the style transfer process with the sliced Wasserstein loss based on the feature maps extracted from a pre-trained CNN model. Our pipeline consists of decoupled steps, allowing users to test various prompt ideas and preview the stylized 3D result before proceeding to the NeRF fine-tuning stage. We demonstrate that our method can transfer diverse artistic styles to real-world 3D scenes with competitive quality. Result videos are also available on our project page: https://haruolabs.github.io/style-n2n/

Style-NeRF2NeRF: 3D Style Transfer From Style-Aligned Multi-View Images

TL;DR

A simple yet effective pipeline for stylizing a 3D scene, harnessing the power of 2D image diffusion models, that can transfer diverse artistic styles to real-world 3D scenes with competitive quality is proposed.

Abstract

We propose a simple yet effective pipeline for stylizing a 3D scene, harnessing the power of 2D image diffusion models. Given a NeRF model reconstructed from a set of multi-view images, we perform 3D style transfer by refining the source NeRF model using stylized images generated by a style-aligned image-to-image diffusion model. Given a target style prompt, we first generate perceptually similar multi-view images by leveraging a depth-conditioned diffusion model with an attention-sharing mechanism. Next, based on the stylized multi-view images, we propose to guide the style transfer process with the sliced Wasserstein loss based on the feature maps extracted from a pre-trained CNN model. Our pipeline consists of decoupled steps, allowing users to test various prompt ideas and preview the stylized 3D result before proceeding to the NeRF fine-tuning stage. We demonstrate that our method can transfer diverse artistic styles to real-world 3D scenes with competitive quality. Result videos are also available on our project page: https://haruolabs.github.io/style-n2n/
Paper Structure (37 sections, 19 equations, 10 figures, 4 tables)

This paper contains 37 sections, 19 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Overall Pipeline. Our method consists of distinct procedures. We first prepare a NeRF model of the source view images. Given the depth maps of the corresponding views (by either estimation or rendering by NeRF), we generate stylized multi-view images using a style-aligned diffusion model. Lastly, we fine-tune the source NeRF on the stylized images using the SWD loss.
  • Figure 2: Sliced Wasserstein Distance.$p$ and $\hat{p}$ are projected onto a random unit direction $V$ (left). The $1$-dimensional Wasserstein distance can be calculated by taking the $L^2$ difference between the sorted projections $p$ and $\hat{p}$ (right). Expectation over random $V$ vectors is a practical approximation of the $N$-dimensional Wasserstein distance.
  • Figure 3: Style Interpolation. An example of style blending using the Wasserstein barycenter between two different style prompts "A person like Marilyn Monroe, pop art style" and "A person like Steve Jobs".
  • Figure 4: Effect of Style-Alignment. An example of source view conversion applied to "Bear" scene using a text prompt "A water painting of a brown bear" with and without shared-attention mechanism within the diffusion pipeline. We find that a fully-shared-attention variant of the style-aligned diffusion model hertz2023style greatly improves style consistencies among generated views.
  • Figure 5: Baseline Comparisons. We compare our method against several variants. The images show an example comparison of the "Bear" scene trained from a style description "A water painting of a brown bear" with a text guidance scale of $7.5$. Note that (b), (c), (e), and (f) are all novel view renders from NeRF. NeRF renderings from (f) ours preserve the original content in (a) without noticeable artifacts compared to (c) Train-from-Scratch and (e) Style-Alignment w/RGB Loss, and also maintain style and color similar to the 2D reference (d). Unlike ours, No Style-Alignment (b) fails to preserve consistent scene color. We encourage our readers to check the results in the video.
  • ...and 5 more figures