Table of Contents
Fetching ...

MM-NeRF: Multimodal-Guided 3D Multi-Style Transfer of Neural Radiance Field

Zijiang Yang, Zhongwei Qiu, Chang Xu, Dongmei Fu

TL;DR

A novel Multimodal-guided 3D Multi-style transfer of NeRF, termed MM-NeRF, achieves high-quality 3D multi-style stylization with multimodal guidance, and keeps multi-view consistency and style consistency between multimodal guidance.

Abstract

3D style transfer aims to generate stylized views of 3D scenes with specified styles, which requires high-quality generating and keeping multi-view consistency. Existing methods still suffer the challenges of high-quality stylization with texture details and stylization with multimodal guidance. In this paper, we reveal that the common training method of stylization with NeRF, which generates stylized multi-view supervision by 2D style transfer models, causes the same object in supervision to show various states (color tone, details, etc.) in different views, leading NeRF to tend to smooth the texture details, further resulting in low-quality rendering for 3D multi-style transfer. To tackle these problems, we propose a novel Multimodal-guided 3D Multi-style transfer of NeRF, termed MM-NeRF. First, MM-NeRF projects multimodal guidance into a unified space to keep the multimodal styles consistency and extracts multimodal features to guide the 3D stylization. Second, a novel multi-head learning scheme is proposed to relieve the difficulty of learning multi-style transfer, and a multi-view style consistent loss is proposed to track the inconsistency of multi-view supervision data. Finally, a novel incremental learning mechanism is proposed to generalize MM-NeRF to any new style with small costs. Extensive experiments on several real-world datasets show that MM-NeRF achieves high-quality 3D multi-style stylization with multimodal guidance, and keeps multi-view consistency and style consistency between multimodal guidance.

MM-NeRF: Multimodal-Guided 3D Multi-Style Transfer of Neural Radiance Field

TL;DR

A novel Multimodal-guided 3D Multi-style transfer of NeRF, termed MM-NeRF, achieves high-quality 3D multi-style stylization with multimodal guidance, and keeps multi-view consistency and style consistency between multimodal guidance.

Abstract

3D style transfer aims to generate stylized views of 3D scenes with specified styles, which requires high-quality generating and keeping multi-view consistency. Existing methods still suffer the challenges of high-quality stylization with texture details and stylization with multimodal guidance. In this paper, we reveal that the common training method of stylization with NeRF, which generates stylized multi-view supervision by 2D style transfer models, causes the same object in supervision to show various states (color tone, details, etc.) in different views, leading NeRF to tend to smooth the texture details, further resulting in low-quality rendering for 3D multi-style transfer. To tackle these problems, we propose a novel Multimodal-guided 3D Multi-style transfer of NeRF, termed MM-NeRF. First, MM-NeRF projects multimodal guidance into a unified space to keep the multimodal styles consistency and extracts multimodal features to guide the 3D stylization. Second, a novel multi-head learning scheme is proposed to relieve the difficulty of learning multi-style transfer, and a multi-view style consistent loss is proposed to track the inconsistency of multi-view supervision data. Finally, a novel incremental learning mechanism is proposed to generalize MM-NeRF to any new style with small costs. Extensive experiments on several real-world datasets show that MM-NeRF achieves high-quality 3D multi-style stylization with multimodal guidance, and keeps multi-view consistency and style consistency between multimodal guidance.
Paper Structure (59 sections, 12 equations, 23 figures, 8 tables, 1 algorithm)

This paper contains 59 sections, 12 equations, 23 figures, 8 tables, 1 algorithm.

Figures (23)

  • Figure 1: (a) HyperNeRF chiang2022hyper generates blurry results and lacks structural details driven by single-modality style (image), (b) MM-NeRF generates higher-quality stylized novel views with multimodal guidance (image and text).
  • Figure 2: Stylized training supervisions generated by 2D style transfer methods. As shown in the zoom-in patch, the local details in different views are stylized inconsistently. In addition, there is a great gap between the two styles, with different tones and strokes.
  • Figure 3: Comparison between common NeRF and AdaIN huang2017arbitrary. The inconsistency of multi-view supervision makes NeRF generate stylized novel views with blurred details.
  • Figure 4: The framework of MM-NeRF. To address the three challenges, we propose a Multi-view Style Consistent Loss $\mathcal{L}_{MSCL}$, a Multi-head Learning Scheme (MLS), and a novel cross-modal feature correction module $\mathcal{F}(.)$, respectively. Multimodal styles are first encoded into CLIP radford2021learning space to generate style features, and $\mathcal{F}(.)$ predicts a correction vector $\Delta f_s^{(1)}$ of the feature of text style $f_{s_t}^{(1)}$ to ensure style consistency of multimodal guidance. MLS uses a shared backbone to learn common features and predicts the parameters according to the style feature by an independent prediction head. By replacing the color head and the opacity head of the NeRF with the predicted parameters, MM-NeRF renders high-quality stylized images by ray marching. $\oplus$ and AdaIN huang2017arbitrary are sum and 2D stylization method, respectively.
  • Figure 5: (a) Illustration of inconsistency between multimodal guidance. A pair of text $s_t$ and an image $s_i$ are mapped to $f_{s_t}$ and $f_{s_i}$, resulting in mismatched stylization. MM-NeRF addresses this problem by introducing a cross-modal feature correction module to predict the feature correction vector $\Delta f_s$ and corrects the feature of the text from $f_{s_t}$ to $f_{s_t}^\prime$. The distance between features of the text-image pair is reduced from $d_1$ to $d_2$. (b) Histogram of CLIP similarity of text-image pairs before and after correction.
  • ...and 18 more figures